LLM: Large Language Models sulle GPU consumer con ExLlamaV2

And Large Language Model (LLM), in Europen “large language model“, is a type of highly parameterized machine learning model designed to understand and generate natural language text. These models are known for their power and extension, being made up of millions or billions of parameters that allow them to capture and reproduce the complexity of the language allegedly.

ExLlamaV2: the most powerful LLMs land on consumer GPUs

One of the most felt problems in the development of applications that integrate functionality based onartificial intelligence, consists of the difficulties associated with loading and using the more extensive LLM models. To best manage them and obtain answers in reasonable terms, hardware configurations are typically required that not all users can afford.

The ExLlamaV2 project, recently published on GitHub, aims to bring the most powerful and versatile LLMs to GPU consumer modern. It’s about a bookshelf which represents a significant advance in natural language processing (Natural Language ProcessingNLP), offering an effective platform for the inference of particularly large language models.

Parameters and inference

In the context of LLMs, i parameters they refer to the “weights” or coefficients that make up the neural model. They are an integral part of the process machine learning and represent the interconnections between neural units within the model. The parameters define how the model interprets and generates data.

In the case of models based on the Transformer architecture, the parameters include the weights of the connections between the layers of the model. As we saw in the related article, theAttention it is a fundamental concept in natural language processing and in machine learning in general. It refers to the model’s ability to give more weight to specific parts of a input during processing, based on their relevance or importance. Weights define how the model takes into account and combines information from different parts of the sequence being analyzed.

With the term inferenceinstead, refers to the phase in which the trained model is used to perform forecasts o processing based on the input data. In other words, during inference, the model applies the “knowledge” acquired during the training phase to perform specific operations on the input data without making further changes to its parameters.

Features of ExLlamaV2 and the “secret” of quantization

The new library ExLlamaV2 It is designed to take full advantage of modern consumer GPUs, enabling complex language processing efficiently and quickly.

In the case of language models, weights are usually represented as floating point numbers (float): the process of quantization converts these numbers into a lower precision format, typically represented with fewer bits (we talk about this in the article on binary code). This process reduces storage space and computational load needed to process weights during inference.

ExLlamaV2 introduces an advanced quantization technique called EXL2 allowing the use of 2, 3, 4, 5, 6 and 8 bit “levels”. In this way it becomes possible to assign a greater precision to the most important parts of the model and lower accuracy to the less crucial parts. The selection of parameters for quantization occurs automatically with an approach aimed at minimizing the error throughout the model.

It is precisely the skillful use of quantization that allows you to significantly reduce the model dimensions and its resource consumption. Large models thus they finally become exploitable on less powerful hardware.

To give an idea of what is possible with ExLlamaV2, just think of the model Llama2 70B (70 billion parameters) can be run on a single GPU con 24 GB of VRAM with 2.55 bits per weight, producing consistent and stable output.

The library accessible by anyone via the correspondent repository GitHub then has the big one advantage to democratize access to LLMs to design and develop wide range of real-world applications, including chatbot conversazionalitools for machine translation, text analysis and much more.

How to install and use ExLlamaV2

L’installation and using ExLlamaV2 require some basic operations, such as cloning the repository and installing dependencies. To proceed, just open one terminal window Linux then run the following command:

git clone https://github.com/turboderp/exllamav2

The loading of addictions takes place by accessing the project folder and then requesting the installation of the contents specified in the file requirements.txt:

cd exllamav2 pip install -r requirements.txt

After installing ExLlamaV2, you can immediately launch a inference test and use some of the available features using the commands below. In particular, in the instruction it must be indicated, instead of percorso_modello the folder containing the LLM downloaded locally:

python test_inference.py -m <percorso_modello> -p "<testo_input>"

Obviously, instead of text_input, the prompt to be transmitted upon entry to the chosen LLM.

To load a chatbot console you can instead issue the following command, specifying the model to use as usual:

python examples/chat.py -m <percorso_modello> -mode llama

If they wanted each other convert existing models Using ExLlamaV2’s advanced quantization, you can proceed with the script convert.py. More information on this is contained on the project’s GitHub page.

Opening image credit: iStock.com/da-kuk

LLM: Large Language Models sulle GPU consumer con ExLlamaV2

ExLlamaV2: the most powerful LLMs land on consumer GPUs

Parameters and inference

Features of ExLlamaV2 and the “secret” of quantization

How to install and use ExLlamaV2

Leave a Reply Cancel reply