These days the presentation of PowerInfera new project that delivers a high-performance “inference engine” into the hands of developers and regular users, designed to support LLM (Large Language Models) on PCs equipped with GPU Of fascia consumer.
The new tool uses a neural network split between GPU and CPU: “hot” neurons are activated in response to different inputs, while “cold” neurons are enabled based on specific inputs. The hybrid model proposed by PowerInfer ensures that the former are preloaded on the GPU while the latter are activated “cold” by relying on the CPU.
How PowerInfer works and how it brings LLMs to consumer GPUs
This scheme significantly reduces memory requirements GPU and CPU-GPU data transfers. So much so that compared to a server-grade GPU like NVidia A100, the configuration based on PowerInfer and a single NVidia RTX 4090 GPU scores just 18% less performance. Succeeding in general token at an average rate of 13.20 per second, with a peak value of almost 30 tokens per second.
PowerInfer supports models like Falcon-40B and the family Llama2, and has been tested on multiple platforms, including x86-64 CPUs and NVIDIA GPUs on Linux and Apple M chips on macOS. To get started, you can follow the installation instructions posted on GitHub, obtain model weights, and perform inference tasks. The project is released with MIT license and its potential can be verified by accessing the online demo based on Falcon(ReLU)-40B-FP16.
The idea behind PowerInfer is inspired, once again, by the functioning of the human brain, as other software applications often do. artificial intelligence. It reflects the trend of some neural synapses (or neurons) to be activated more frequently than others in any language model. In simple terms, as mentioned above, there are “hot” neurons that are activated often and “cold” neurons that vary more based on specific inputs. This balanced approach contributes to a more fast processing some data.
Adaptive predictors and quantization
To further optimize its performance, PowerInfer leverages the so-called adaptive predictors, components that try to anticipate or predict which neurons will be activated next. The sparse operators refer to techniques that only manage “important” or activated data, reducing complexity computational.
Finally, with PowerInfer the user can also leverage the quantizationa technique that reduces data precision to reduce memory requirements and increase system efficiency. INT4, for example, is a quantization specification that represents data with 4 bits. Using this approach, you can manage model weights more effectively.
Model weights and downloads
I model weights refer to the LLM parameters that PowerInfer exploits during inference activities. They include the weights of connections between neurons, i bias and other parameters that completely define the structure and behavior of the model. Overall, they represent the information learned during the training phase of the model and are the mirror of the semantic relationships between the various elements.
To get the PowerInfer model weights, you can download them from the Hugging Face repository. The weights available in files with extension
.powerinfer.gguf: the model that does not contain
q4 as a prefix it is the non-quantized one.
Opening image credit: iStock.com – Just_Super