gigaGPT: A model like GPT-3 in just 565 lines of code

Brains is a company, which we have often talked about, that focuses on the design and development of processors for artificial intelligence (AI) and deep learning. Founded in 2016 by Andrew Feldman, Cerebras has garnered great interest from industry giants for creating an innovative hardware solution specifically designed for computational needs of AI. The company today announced the availability of gigaGPTpresented as the simplest and most compact solution for training and optimizing generative models.

What is gigaGPT and how does it work

The developers of Cerebras explain that they used, as a starting point, the nanoGPT model of Andrej Karpathy, founding member of OpenAI and a real big name in the field of artificial intelligence solutions.

While nanoGPTpublished on GitHub, can train models using up to 100 million parameters, gigaGPT is designed to train models with over 100 billion parameters. What’s unique about gigaGPT is that it achieves these results without introducing additional code or relying on third-party frameworks. The whole repository it is in fact made up of just 565 lines of code.

The architecture of a transformer (we have seen in detail what it is) is simple; train a transformer of large dimensions on a large number of GPUs is decidedly more complex.

Models based onGPT architecture (Generative Pre-trained Transformer) without significant modifications or customizations suffer from limitations in terms of memory even on the highest performing GPUs, when exceeding a few billion parameters. The alternative is to fragment the models into smaller pieces, distribute them across multiple GPUs, coordinate the workload across nodes, and “assemble” the results. This process, known as “LLM scaling“, is complicated and requires the use of frameworks like Megatron, DeepSpeed, NeoX, Fairscale e Mosaic Foundry. The gigaGPT model developed by Cerebras engineers offers a simpler and lighter approach.

A model that takes full advantage of the Cerebras processor architecture

Cerebras’ main product is called Wafer Scale Engine (WSE): it is a processor that stands out for its extraordinarily large size. Instead of having numerous smaller chips on a single motherboard, WSE is essentially one single chip covering the entire surface of a silicon slab or wafer. This approach allows you to minimize the communication times between chips, improving performance for highly parallelized workloads typical of the model training phase machine learning.

Cerebras’ stated goal is to accelerate progress in the field of artificial intelligence solutions by providing a hardware platform on which it is possible to run advanced calculation operations more efficiently than traditional architectures.

The structure of the gigaGPT model, in brief

gigaGPT does not use sharding o pipelining as the model fits completely into the Cerebras system memory. THE cluster Wafer Scale Cerebras systems consist of 1 to 192 Cerebras CS-2 systems, supported by CPU server nodes that store parameters, data, and an interconnector (SwarmX).

Lo sharding refers to dividing a large data set or model into smaller portions (shard): Everything is fine shard represents a part of the overall workload. The pipelining is a technique in which different stages of a process are performed simultaneously, with the output of one stage becoming the input for the next.

The model just revealed by Cerebras technicians is mainly composed of Python files model.py e train.py. Use primitives of PyTorcha popular open source framework for machine learning and deep learning developed by Facebook AI Research lab (FAIR), without the use of complex external libraries.

Il training loop is managed by cerebras_pytorchand wrapper Custom PyTorch, specialized to run on Cerebras systems. There data management is entrusted to two “ad hoc” components (cstorch.utils.data.DataLoader e cstorch.utils.data.DataExecutor), who coordinate data distribution and training operations on Cerebras systems.

gigaGPT is published, in the form of an open source product, in the official GitHub repository.

gigaGPT: A model like GPT-3 in just 565 lines of code

What is gigaGPT and how does it work

A model that takes full advantage of the Cerebras processor architecture

The structure of the gigaGPT model, in brief

Leave a Reply Cancel reply