GPU, how to create one yourself starting from 15 Verilog files

GPU, how to create one yourself starting from 15 Verilog files

To learn how a CPU works, from architecture to control signals, there are countless resources available online. In the case of GPUhowever, the panorama is completely different.

There is certainly no shortage of documentation for learning how to program GPUs but there is almost nothing on it operation of graphics processing units at the hardware level. Why? Because the market is extremely fierce ei details technicians a low level for all modern architecture they remain owners.

The reason for the project tiny-gpu is precisely this: turning on a light in the fog and providing valid help to all those who want to try their hand at designing and creating a GPU.

How to create a GPU with the tiny-gpu project

Built using less than 15 files Verilog fully documented, the tiny-gpu GPU is a bare-bones unit that can still count on an accurate description of the architecture and ISA (Instruction Set Architecture), code successfully implemented to perform matrix addition and multiplication operations on a GPU, kernel execution – that is, portions of code executed on the GPU in parallel on different threads or cores.

It is also possible track execution of the kernel, observing which instructions are executed, in what order and with what data. L’execution tracing provides a detailed account of how the GPU executes code, allowing developers to analyze each program’s behavior, identify any errors or inefficiencies, and optimize performance.

tiny-gpu is a minimal implementation of an optimized GPU: the manufacturers focus on the one hand GPU general-purpose (GPGPU), on the other hand accelerators for machine learning like Google TPUs. The authors of the tiny-gpu project highlight the general principles of these architectures, rather than specific hardware details. The idea is to focus on the fundamental elements that are crucial for all modern people hardware accelerators.

The tiny-gpu project takes the reader by the hand, guiding him to discover topics such as the GPU architecture, the parallelization (how is the SIMD, Single Instruction, Multiple Data, model implemented in hardware?) and the memory (the GPU must perform its tasks within memory bandwidth constraints).

The more advanced final section presents some important ones optimizations applied on GPUs placed on the market with the precise aim of maximizing their performance.

Low-level GPU architecture

The image is taken from the tiny-gpu GitHub repository.

The architecture of tiny-gpu

Verilog is a hardware description language (HDL, Hardware Description Language) used primarily for digital circuit design and simulation. It is widely used in the digital electronics industry to design and verify hardware, such as integrated circuits, programmable logic systems (FPGAs), and ASICs (Application-Specific Integrated Circuit).

In the case of tiny-gpu with less than 15 Verilog files, a fully functional GPU capable of running a single kernel at a time is created. The GPU itself is composed of a device control register, a dispatcher, a variable number of compute cores, a memory controller for data memory, and program memory, cache.

Il audit register stores metadata that contains information about how kernels should run on the GPU. The dispatcheron the other hand, is the unit that deals with the distribution of threads on the different computing cores.

The GPU is built to interface with a global memory external. In the case of tiny-gpu, data memory and program memory are separated for simplicity.

The data memory uses an 8-bit address space (256 total memory locations); the data is 8 bits. There program memory It also uses 8-bit space but the data is stored in 16-bit blocks (each instruction is 16 bits as specified by the ISA).

tiny-gpu also takes advantage of a memory controller which tracks all outgoing requests to memory from compute cores, modulates the requests based on the actual bandwidth of external memory, and transmits responses from external memory to the appropriate resources. The management of the cache to free up bandwidth and speed up processing.

In this simplified GPU, every core processes one block at a time (group of threads running in parallel on a single core). For each thread in each block, the core has an ALU (arithmetic-logic unit), an LSU (load-store unit), a PC (program-counter) and dedicated registers.

The role of the scheduler

Each tiny-gpu core uses a single scheduler responsible for managing the execution of threads. The scheduler executes instructions in a single block to completion before moving to a new block, and executes instructions for all threads synchronously and sequentially. In more advanced schedulers, techniques such as the pipelining in order to maximize the use of resources.

The main constraint that the scheduler must work with is the latency associated with loading and storing data from global memory. While most instructions can be executed synchronously, these load/store operations are asynchronous, meaning that the rest of the instruction execution must be built around these long wait times.

The ISA on tiny-gpu

The tiny-gpu project implements a simple ISA of 11 instructions designed to allow the use of simple kernels capable of performing matrix addition and multiplication. We quote the instructions on which this is based GPU minimale:

  • BRnzp: Jump instruction to move to another line in program memory if the contents of the NZP register match the condition nzp in education.
  • CMP: Compares the value of two registers and stores the result in the NZP register to be used for a subsequent BRnzp instruction.
  • ADD, SUB, MUL, DIV: Basic arithmetic operations to allow tensor processing. A tensor is a general mathematical concept that generalizes multidimensional vectors and matrices. Tensors are often used to represent multidimensional data, such as images, videos, sensory data, or data matrices in machine learning and artificial intelligence applications.
  • LDR: Load data from global memory.
  • STR: Store data in global memory.
  • CONST: Load a constant into a register.
  • RET: Reports that the current thread has reached the end of execution.

Each register uses 4 bits: a total of 16 registers. The first 13 registers R0 – R12 are free registers that support read/write. The last 3 registers are special registers read-only.

Overall, tiny-gpu is just a “taste” of how a GPU works. Many critical aspects of modern drives, such as support for instructions artificial intelligencethe most advanced branch prediction and cache implementation, are still missing in the open implementation just shared publicly.

However, tiny-gpu shows the fundamental principles that allow theparallel execution of kernels on a GPU and therefore proves to be a very valuable project for all those who want to really understand how GPUs work starting from the lowest level. GPUs are complex, but an initiative like tiny-gpu provides a solid foundation from which to understand the logic and architecture that makes these hardware devices so powerful.

Opening image credit: iStock.com – CasarsaGuru

Leave a Reply

Your email address will not be published. Required fields are marked *