How to explain the functioning of an LLM language model without using mathematics (or almost)

There is a lot of talk, at all levels, about generative artificial intelligence and the term linguistic models It’s now on everyone’s lips. THE Large Language Models (LLM) are highly computationally powerful language models designed to understand and generate text similarly to how a real person would. They are capable of performing a wide range of linguistic tasks, such as machine translation, text generation, answering questions, creating summaries and much more. They can “understand” the context and produce coherent text and meaningful in response to questions or natural language input.

In our articles, we generally put the verb “understand” in quotation marks because LLMs obviously cannot rely on the same mechanisms underlying the human brain. They try to approximate its functioning, sometimes quite brilliantly, in other cases less effectively. But we are always talking about an approximation.

What does an LLM do (Large Language Model)

As we have highlighted in other insights, the functioning of LLMs is not well understood by many users. Behind theirs mode of operation there is a lot of mathematics and statistics: for this reason it is considered difficult.

At the heart of LLMs is the concept of token: often superimposed on single words, they can actually represent sequences of characters or multiple words (or portions of them). A token can also express a point or a space: the objective of every LLM is in fact encode the text in the most efficient way possible.

To each token present in the vocabulary specific to a specific LLM, a unique numerical identifier is assigned. The LLM uses a tokenizer (translatable into Europen as “tokenizzatore“) to transform the text into an equivalent numerical sequence. These numerical representations are precisely tokens.

A practical example with Python

The Python code we propose below uses the library tiktoken to encode and decode texts using the GPT-2 model. This is the simplest model that OpenAI has released as an open source product, therefore it can be studied in detail. In another article we saw how to use spreadsheets and GPT-2 to demonstrate how LLMs work.

After importing the module, the code encodes the given sentence using the GPT-2 template, with the function encoding_for_model(). Do you see the numbers returned for each token? Here, by carrying out a decoding operation, you can obtain the initial sentence.

>>> import tiktoken >>> encoding = tiktoken.encoding_for_model("gpt-2")

>>> encoding.encode("Il gatto nero attraversa la strada buia.") [33666, 308, 45807, 299, 3529, 708, 430, 690, 64, 8591, 965, 4763, 809, 544, 13]

>>> encoding.decode([33666, 308, 45807, 299, 3529, 708, 430, 690, 64, 8591, 965, 4763, 809, 544, 13]) 'Il gatto nero attraversa la strada buia.'

Interestingly, as we mentioned previously, a token does not necessarily contain a word. For example, `[308]` is decoded to ‘ g’ (the beginning of the word “cat” with a leading space), `[45807]` in ‘act’ and `[13]` with a period. Look at the output obtained in the figure by calling the function encoding.decode:

Keep in mind that the library tiktoken invoked via Python, it must necessarily be installed using the command pip install tiktoken.

Language models make predictions

Let’s go back to the statement we started with: when referring to the “comprehension” skills of LLMs, we put the term in quotation marks. Because the approach used by these systems is basically probabilistic. Or better yet, as we talked about in the article on generative models at the service of business decisions, the scheme exploited is a type one stochastic.

Language models do forecasts on which token will follow the one being processed. Imagine a function that returns the probability with which every single term contained in the model’s vocabulary can follow a specific token.

Since the GPT-2 vocabulary is based on the use of 50.257, the function (developed for example in Python) would provide a list of numbers indicating the probability with which the corresponding token could follow after the single word under consideration.

In the case of the example “The black cat crosses the dark street“, it can be hypothesized that – for example – a word like “tuna” has a near-zero probability of following in the sentence. Most likely it could follow a conjunction like “and”. “The black cat crosses the dark road and hides in a hedge“. Or, again, “and gets run over” o “and jumps a ditch” or again “and gets into a taxi” (certainly less likely…).

Reasonable predictions come from careful training phase of the model. Volumes of text that are often impressive in size are passed to the model. In this way, he can understand the semantic links between the words of the language, always and only in probabilistic terms.

At the end of training, the model can calculate the probabilities with which each token can occur in the various sentences, depending on the sequences of tokens already present, based on the data structures built using all the text processed during the training phase.

Generation of long text sequences

In many articles we have said that create a well-crafted prompt it is the key to obtaining reasoned, contextualized and relevant answers from LLMs.

Imagine a hypothetical Python function that takes as input a user-supplied sentence (the prompt). What the model does, first of all, is tokenizzare il prompt then generate a sequence of unique numeric values (we saw this before with encoding.encode).

At this point, depending on the length of the text that you want to get in output, makes a series of predictions about the most likely tokens to complete the input token sequence. Have you noticed how LLMs reuse a lot of the input provided by the user in composing the response?

The Python function we’re talking about could simply select tokens with the higher probability to follow another token. This approach is called, in English, greedy selection. By using a pseudorandom number generator, however, it is possible to make the mechanism more “dynamic”: instead of always choosing the most probable tokens, one can randomly select one among those that – in terms of probability – exceed a certain theshold.

Now you understand why, providing a same prompt to a generative model at several different times, this usually provides answers that are sometimes very different from each other.

The temperature in the LLM

In the context of LLMs, the temperature is a parameter used in text generation that controls how “creative” or “conservative” the text should beoutput.

More specifically, temperature regulates the probability distribution of the words generated by the model. At a higher temperature, the model is more likely to produce creative results and surprising, as it distributes probability more evenly across different words in its vocabulary. In contrast, at a lower temperature, the model generates more conservative and consistent results, as it relies on most probable words absolutely (concept of greedy selection).

In the advanced settings of some LLMs you may have found references to the so-called hyperparameters top_p e top_k: Control how many of the most likely tokens are considered for selection.

Once a token is chosen with the approach described above, the cycle repeats. The function described previously is therefore called again by passing an input enriched with the new token. This generates a further token that follows the one just “queued” and so on. The process continues.

LLMs do not have the concept of sentences and paragraphs, because – as we have seen – they work on one token at a time. To prevent the generated text from being truncated mid-sentence, the code can be set to break the cycle when the token corresponding to a point is generated.

How LLM training works

Simplifying as much as possible, let’s suppose we use an LLM that uses a vocabulary made up of just 8 words in Europen: “il”, “dog”, “cat”, “eat”, “drink”, “milk”, “meat” , “bread”.

Suppose that each word corresponds to a single token. Let’s also assume we have a set of only 3 training sentences:

“The dog eats meat”
“The cat drinks milk”
“The dog eats bread”

The LLM model learns to predict the probability of each token based on the context of the training sentences. Then create a table, in this…