AIG. - AI Google Gemini Integration

<< Click to Display Table of Contents >>

Navigation:  3. Script Language > AI - Artificial Intelligence Commands > AIG. - Google AI > 6. Core Operations >

AIG. - AI Google Gemini Integration

AIG.AskVision

Previous Top Next


MiniRobotLanguage (MRL)

 

AIG.AskVision
Analyze Images, Audio, or Documents with Google Gemini

 

Intention

 

AskVision: The Multimodal Interface
 
The AskVision Command extends the capabilities of the standard Ask command by allowing you to send binary media data along with your text prompt. This enables "Multimodal" interactions where the AI can see, hear, or read documents.

You can use this command to:

Analyze images (OCR, object detection, description).

Transcribe and summarize audio files.

Extract information from PDF documents.

 

What is the AskVision Command?

 

This command constructs a complex JSON payload that includes both the user's text prompt and the media content.

It features a smart Hybrid Input System for the media parameter (P2):

1.File Path: If you provide a path to a valid file on your disk (e.g., "C:\Images\photo.jpg"), the command automatically reads the file and converts it to the required Base64 format before sending.

2.Base64 String: If you provide raw Base64 data (e.g., from the clipboard or memory), the command sends it directly.

The result is returned in the specified variable (P4) or on the Top of Stack (TOS), just like the standard `AIG.Ask` command.

 

Why Do You Need It?

 

Image Analysis: "What is in this picture?", "Extract the text from this invoice".

Audio Processing: "Summarize this meeting recording", "Transcribe this voice note".

Document Understanding: "Summarize this PDF contract".

 

How to Use the AskVision Command?

 

1. Select a Model: You must use a model that supports multimodal input. Use AIG.SetModel to select models like `gemini-1.5-flash`, `gemini-1.5-pro`, or `gemini-2.0-flash`. Older text-only models (like `gemini-1.0-pro`) will return an error.

2. Prepare the Data: Locate your file or have the Base64 string ready.

3. Call Command: Supply the Prompt (P1), File/Data (P2), and the correct MIME type (P3).

 

Example Usage

 

' --- Scenario 1: Analyzing an Image from Disk ---

AIG.SetModel|gemini-1.5-flash

 

' The command automatically reads and encodes the file

$$File = "C:\MyDocuments\Invoice_Scan.jpg"

AIG.AskVision|Extract the total amount from this invoice.|$$File|image/jpeg|$$Result

DBP. $$Result

 

' --- Scenario 2: Analyzing an Audio File ---

$$AudioFile = "C:\Recordings\meeting.mp3"

AIG.AskVision|Create a bullet-point summary of this conversation.|$$AudioFile|audio/mp3|$$Summary

DBP. $$Summary

 

' --- Scenario 3: Using Memory/Clipboard Data ---

' Assume $$Base64 contains raw image data from clipboard

AIG.AskVision|Describe this image.|$$Base64|image/png|$$Description

 

Syntax

AIG.AskVision|P1|P2|P3[|P4]

 

Parameter Explanation

 

P1 - (Required) Prompt Text. The question or instruction regarding the media (e.g., "What is this?").

P2 - (Required) File Path OR Base64 Data.

If you pass a valid file path (e.g., "C:\image.png"), the robot automatically loads and encodes it.

If you pass a long text string (Base64), it is used directly.

P3 - (Required) MIME Type. Defines the format of P2. Common values:

Images: `image/png`, `image/jpeg`, `image/webp`

Audio: `audio/mp3`, `audio/wav`, `audio/aac`

Documents: `application/pdf`

Video: `video/mp4`

P4 - (Optional) Variable to store the text result. If omitted, result is on TOS.

 

Remarks

 

- Requires an internet connection.

- The input file size is limited by the Google API (typically up to 20MB for direct payload; larger files may require Google Cloud Storage URIs which are not supported by this specific command).

- Media input consumes significantly more tokens than text. An image is typically ~258 tokens.

 

See also:

AIG.Ask

AIG.FileToBase64

AIG.GenerateImage