3. Script Language > AI - Artificial Intelligence Commands > AIC. - Artificial Intelligence Command > AI

AIC.Estimate Token Count

Previous Top Next

MiniRobotLanguage (MRL)

AIC.Estimate Token Count
Get the estimated Amount of Tokens of a Text or String.

clip0683

Intention

The AIC.Estimate Token Count command will deliver a rough approximation "how many Tokens" a Text String has.

This might be useful to check if a Text or String "will fit" into a model's Token-Maximum.

OpenAI's GPT models, like GPT-3, generally use a variant of the Byte Pair Encoding (BPE) tokenization.

The number of BPE (Byte Pair Encoding) pairs in the original vocabulary of models like GPT-2 and GPT-3 is in the order of tens of thousands.

For instance, GPT-2 uses a vocabulary size of 50,257 tokens. GPT-3, being a larger and more advanced model, has a similar vocabulary size.

Therefore the AIC.Estimate Token Count Command will just do a statistical driven estimation of the Token-Count, depending on the appeareance
of spaces and characters in the Text.

Syntax:

AIC.Estimate Token Count|<Text>|<Variable for Result>

Parameters:

<Number>: An integer value representing the number of outputs you want to generate. This number should be greater than or equal to 1.
Note that setting a very high number may have cost implications and might be subject to rate limits imposed by the OpenAI API.

Example Usage:

$$TXT=This is a Text to be tokenized and counted

AIC.Estimate Token Count|$$TXT|$$NUM

MBX.The ETC of the Text is: $$NUM

This example will output the estimated Tokencount of the given Text. Generally the estimated Tokencount is between 2.0 to- 4.5 characters per Token.

Syntax

AIC.Estimate Token Count|P1[|P2]

AIC.etc|P1[|P2]

Parameter Explanation

P1 - Variable or Text to evaluate the estimated Token Count..

P2 - opt. Variable for the result. If omitted, TOS is used..

Example

'*****************************************************

' EXAMPLE 1: AIC.-Commands

'*****************************************************

Remarks

OpenAI's GPT models, like GPT-3, generally use a variant of the Byte Pair Encoding (BPE) tokenization.

The number of BPE (Byte Pair Encoding) pairs in the original vocabulary of models like GPT-2 and GPT-3 is in the order of tens of thousands.
For instance, GPT-2 uses a vocabulary size of 50,257 tokens. GPT-3, being a larger and more advanced model, has a similar vocabulary size.

It's important to note that these tokens are not just BPE pairs, but also include individual characters, common words, and subwords.
The BPE algorithm is used to construct this vocabulary by iteratively merging frequent pairs of characters or subwords.

The vocabulary is an essential part of the model, and it is constructed during the pre-training phase.
When tokenizing text for input to the model, the tokenizer uses this pre-established vocabulary to convert the text into a sequence of token IDs that the model can process.

To exactly replicate the tokenization used by GPT models, one would need to use the same vocabulary and tokenization algorithm as used in the pre-training of these models.

As it would need a large Token Dataset with more then 50.000 Tokenpairs, we do just a statistical driven Token-Count estimation with the SPR.