I generative models they represent one of the most important innovations of the last decade. In another article we saw how to run and use LLM with Ollama locally on your systems, which was unthinkable until a few months ago. In a surprise move, NVidia tries to “democratize” access to solutions artificial intelligence releasing the first version of TensorRT-LLM for Windows.
What is TensorRT-LLM and how does it work
Already revealed in the past, TensorRT-LLM is an open source library that accelerates inference performance with the latest advanced language models (LLM, Large Language Models), such as Llama 2 and Code Llama. However, if, starting from September 2023, TensorRT-LLM had reached the world of data centers, now NVidia offers the same possibilities to none other than users Windows.
The GeForce RTX and RTX GPUs, equipped with processors dedicated to carrying out processing related to artificial intelligence (Tensor Core) enable you to enjoy the power of LLMs, natively, on over 100 million Windows-based PCs and workstations. According to NVidia, the version of TensorRT-LLM for Windows is four times faster than any solution available today and allows you to reduce processing times by relying on the computational power of RTX cards.
NVidia has also released software tools to help developers accelerate their advanced language models, including scripts that optimize custom models with TensorRT-LLM, open source models optimized with TensorRT, and a reference design that demonstrates achievable response speed and quality.
What can be done with TensorRT-LLM
The possibles application fields of TensorRT-LLM are in fact practically infinite: LLMs are in fact at the center of new workflows pivoting on AI; they are also, obviously, protagonists in software that automatically analyzes data and generates a vast range of content.
The leap forward in terms of performance guaranteed by TensorRT-LLM opens the doors to increasingly sophisticated uses: development of writing assistants and for it software development which produce multiple quality results in a completely autonomous manner. The solution proposed by NVidia allows you to integrate LLM capabilities with other technologies, for example retrieval-augmented generation (RAG), an approach within natural language processing (NLP) that combines text generation with the ability to retrieve and use information from a reference dataset or knowledge base.
In practice, during the generation process, the model has access to a set of related documents or data and can retrieve fragments or specific information to integrate them into the generated text. This allows you to create answers more informed e relevantespecially when understanding context is crucial.
An example of a RAG application could be a virtual assistant which, when asked a question, not only generates an answer based on its internal “knowledge” (developed during training), but can also retrieve additional information from an external source to provide a more complete and accurate answer.
Stable Diffusion accelerata conĀ TensorRT
Diffusion models, such as Stable Diffusion, are used to create works of art surprising using artificial intelligence. TensorRT accelerates these models through layer fusion, precision calibration, auto-tuning of the kernel and other features that greatly increase theefficiency and the speed of inference. Thus TensorRT aims to become the reference point in fact for real-time applications and resource-intensive tasks.
NVidia claims that TensorRT is capable of doubling the speed of Stable Diffusionmaking it compatible with the popular WebUI distribution of Automatic1111. This acceleration allows users to iterate fasterreducing waiting times and obtaining the final image more quickly.
TensorRT-LLM for Windows will soon be available for download from the NVidia developer website. Open source models optimized for TensorRT and RAG are published on repository Company GitHub.
The opening image is taken from NVidia’s presentation post.