StreamingLLM turbocharges language models: how it works

I Large Language Models (LLM) are language models extremely advanced and articulate. Formed on huge quantities of linguistic data, learn the structure and characteristics of human language. The use of LLM has led to significant advances in many areas oflanguage processing natural. In fact, modern generative models highlight exceptional performance in understanding of the context and in the generation of texts. However, there are some important gray areas: in current LLMs, in fact, the information initially managed is lost as theinput becomes longer. They are therefore not suitable for demanding tasks such as creating book summaries or documents composed of a large amount of information.

StreamingLLM is the answer to these needs: it is a new approach that allows the processing of input infinitimaintaining affordable costs and performance in line with those expected, for example, from an advanced chatbot.

The idea behind StreamingLLM

The project published on GitHub proposes a new approach for modern LLMs: precisely called StreamingLLMaims to handle inputs of potentially infinite length without sacrificing efficiency e performance. The scheme described by the researchers addresses the challenges posed by large inputs and data multi-round dialogueswhich require prolonged interactions.

One of the main problems, which usually brings down known LLMs, has to do with the memory management. Not to mention the difficulty shown by the most popular models in processing (generalization) of longer inputs than the length of texts on which they were trained. There generalization it is the ability of a model to apply the “knowledge” acquired during training to new situations or data not previously seen.

The so-called Training Sequence Length indicates the length of the texts used during the training phase of the model. This length is a critical parameter and determines the size of the context that the model can consider when generating texts or trying to understand them.

With the introduction of StreamingLLM, you can rely on a efficient framework which allows LLMs to generalize over sequences of infinite length without requiring additional training steps.

How StreamingLLM works

In a language model, a token is the basic unit that represents a word, substring, or symbol within a sequence of text. THE attention-based models refer to the class of machine learning models that use a specific mechanism to assign different weights to the various parts of the input provided by the user during the processing process.

A group of Google researchers presented in 2017 the attention mechanism and the idea of ​​the Transformers in a historic and enlightening document entitled Attention Is All You Need. It is inspired by human capacity Of to focus on certain elements within a set of information. In the context of language models, attention allows the model to give more weight to certain tokens or parts of a sequence. The use of attention contributes greatly to the models ability to capture relationships long term within one sequence of data.

Current language models compute for each token two sets of parameters known as keys (Key) e values (Value). These parameters are used to weight the relative importance of tokens during the text generation or understanding the context. Information about previous tokens is stored in memory (caching) to facilitate the processing of new tokens.

What are attention window ed attention sink

L’attention window refers to a “token window” within a text sequence that the model pays particular attention to during processing. Represents the number of tokens precedents that the attention-based language model focuses on during the generation phase of a new token. The window size may vary depending on the model architecture and specific parameters used during training.

With StreamingLLM, researchers introduce the concept of attention sink which consists of maintaining initial token keys and values ​​to maximize the performance of theattention window. Thus, even LLMs trained with a finite-length attention window can generalize over infinite-length sequences without requiring further adaptation (fine-tuning).

The advantages of using StreamingLLM

The research team that focused on the development of StreamingLLM demonstrated that models such as Llama-2, MPT, Falcon and Pythia can effectively perform language modeling tasks in a stable and efficient manner, even going beyond 4 million tokens. By using some precautions in the pre-training phase it is possible to further improve performance in processing long inputs and complex.

Compared to the solutions used to date, StreamingLLM can therefore guarantee a performance improvement equal to over 22 times. This video compares the behavior of an LLM used without and with StreamingLLM, in the processing phase of virtually infinite inputs. In one case, after a certain number of interactions, the model performance they decay vertically and then memory problems appear. In the second case, however, the flow continues without stopping, without performance problems and without any interruption.

Opening image credit:


Please enter your comment!
Please enter your name here