|
<< Click to Display Table of Contents >> Navigation: 3. Script Language > AI - Artificial Intelligence Commands > AIG. - Google AI > 6. Core Operations > AIG. - AI Google Gemini Integration |
MiniRobotLanguage (MRL)
AIG.AskVision
Analyze Images, Audio, or Documents with Google Gemini
Intention
AskVision: The Multimodal Interface
The AskVision Command extends the capabilities of the standard Ask command by allowing you to send binary media data along with your text prompt. This enables "Multimodal" interactions where the AI can see, hear, or read documents.
You can use this command to:
�Analyze images (OCR, object detection, description).
�Transcribe and summarize audio files.
�Extract information from PDF documents.
This command constructs a complex JSON payload that includes both the user's text prompt and the media content.
It features a smart Hybrid Input System for the media parameter (P2):
1.File Path: If you provide a path to a valid file on your disk (e.g., "C:\Images\photo.jpg"), the command automatically reads the file and converts it to the required Base64 format before sending.
2.Base64 String: If you provide raw Base64 data (e.g., from the clipboard or memory), the command sends it directly.
The result is returned in the specified variable (P4) or on the Top of Stack (TOS), just like the standard `AIG.Ask` command.
�Image Analysis: "What is in this picture?", "Extract the text from this invoice".
�Audio Processing: "Summarize this meeting recording", "Transcribe this voice note".
�Document Understanding: "Summarize this PDF contract".
1. Select a Model: You must use a model that supports multimodal input. Use AIG.SetModel to select models like `gemini-1.5-flash`, `gemini-1.5-pro`, or `gemini-2.0-flash`. Older text-only models (like `gemini-1.0-pro`) will return an error.
2. Prepare the Data: Locate your file or have the Base64 string ready.
3. Call Command: Supply the Prompt (P1), File/Data (P2), and the correct MIME type (P3).
' --- Scenario 1: Analyzing an Image from Disk ---
AIG.SetModel|gemini-1.5-flash
' The command automatically reads and encodes the file
$$File = "C:\MyDocuments\Invoice_Scan.jpg"
AIG.AskVision|Extract the total amount from this invoice.|$$File|image/jpeg|$$Result
DBP. $$Result
' --- Scenario 2: Analyzing an Audio File ---
$$AudioFile = "C:\Recordings\meeting.mp3"
AIG.AskVision|Create a bullet-point summary of this conversation.|$$AudioFile|audio/mp3|$$Summary
DBP. $$Summary
' --- Scenario 3: Using Memory/Clipboard Data ---
' Assume $$Base64 contains raw image data from clipboard
AIG.AskVision|Describe this image.|$$Base64|image/png|$$Description
Syntax
AIG.AskVision|P1|P2|P3[|P4]
Parameter Explanation
P1 - (Required) Prompt Text. The question or instruction regarding the media (e.g., "What is this?").
P2 - (Required) File Path OR Base64 Data.
�If you pass a valid file path (e.g., "C:\image.png"), the robot automatically loads and encodes it.
�If you pass a long text string (Base64), it is used directly.
P3 - (Required) MIME Type. Defines the format of P2. Common values:
�Images: `image/png`, `image/jpeg`, `image/webp`
�Audio: `audio/mp3`, `audio/wav`, `audio/aac`
�Documents: `application/pdf`
�Video: `video/mp4`
P4 - (Optional) Variable to store the text result. If omitted, result is on TOS.
Remarks
- Requires an internet connection.
- The input file size is limited by the Google API (typically up to 20MB for direct payload; larger files may require Google Cloud Storage URIs which are not supported by this specific command).
- Media input consumes significantly more tokens than text. An image is typically ~258 tokens.
See also: