Computer

Black box of a generative model opened for the first time: Anthropic reveals the secrets

Black box of a generative model opened for the first time: Anthropic reveals the secrets

Traditionally, a generative model used in artificial intelligence applications is always treated as a “black box” (black box in English). Once input data has been inserted (prompt or input), the model carries out some processing internally and then produces a output. Why the model generates one particular response rather than another, however, is not obvious. In another article we tried to explain how an LLM works (Large Language Model) without using mathematics.

Founded in 2021 by some former OpenAI employees, Anthropic it is the “in great shape” company that brought Claude artificial intelligence to Europe. The latest incarnation of the model, Claude 3 it was described from the first moment as capable of highlighting superior performance compared to main competitors such as OpenAI’s ChatGPT and Google’s Gemini.

However, it is Anthropic itself that shines a light on an aspect that is often left in the background. If models are only thought of as black boxes, how is it possible to blindly trust them safety? If we don’t know how they work, how can we ensure they don’t produce harmful, biased, untrue or dangerous responses? The need for one greater transparency it’s obvious.

The first study on the behavior of generative models is by Anthropic

In the field of artificial intelligence, understanding the inner workings of LLMs represents a complex challenge. Anthropic engineers explain that they recently carried out a study aimed at identifying how millions of concepts are represented within Claude Sonnetone of the company’s latest generation language models.

The work carried out by Anthropic (which we invite you to consult) is in fact proposed as the first detailed analysis of a modern artificial intelligence model used in production. The goal is not only to instill awareness in end users, but also to open up new avenues for making i safer models and reliable.

Look inside the generative model

Anthropic technicians explain that taking a look at the contents of the “black box” is not enough. The elements that the model processes before generate a response, in fact, as we observed in the article cited at the beginning, are made up of a long list of numbers, known as “neuronal activations”, without a clear meaning. The fact that models like Claude are able to understand and use a wide range of concepts it’s obvious. But we cannot discern these concepts by directly observing them neurons. Each concept is represented by many neurons, and each neuron helps represent many concepts.

In the past, Anthropic has identified some patterns of neuronal activationcalled “feature”, associating them with “humanly interpretable” concepts. Using a technique called “dictionary learning”, derived from machine learning classic, the company took steps to isolate recurring neuronal activation patterns in various contexts. Any internal state of the model can thus be represented in terms of a few feature active rather than with many active neurons. Like every word in a dictionary is made up of letters and every sentence is made up of words, every feature in an AI model is made up of neurons and every internal state is made up of features.

Revolutionary results

The Anthropic study clearly demonstrates that not even the developers of AI models can know, in detail, all the secrets of the LLMs they create.

The company’s engineers claim to have successfully extracted millions of features from the middle layer of Claude 3.0 Sonnetproducing a coarse conceptual map of its internal states halfway through the computational process.

The features found in Sonnet have a depth, breadth and abstraction that reflect the advanced capabilities of the generative model. Anthropic says it has identified features corresponding to a wide range of entities such as cities (San Francisco), people (Rosalind Franklin), atomic elements (Lithium), scientific fields (immunology) and programming syntax (function calls). These features are multimodal e multilingualresponding to images of a given entity, as well as its name or description in many languages.

It was possible to measure some kind of “distance” between features based on the neurons involved in their activation patterns, allowing the search for features that are “close” to each other. For example, near the “Golden Gate Bridge“, Anthropic found features for Alcatraz, Ghirardelli Square (known square in San Francisco), the Golden State Warriors (San Francisco basketball team), California governor Gavin Newsom, the 1906 earthquake and Alfred Hitchcock’s film ” Vertigo”, set in San Francisco. Not to mention the many more abstract features.

The evidence discovered by Anthropic represents an important step forward in making AI models safer. The techniques described could be used to monitor AI systems avoiding dangerous behaviors, directing them towards desirable outcomes, or removing “harmful” topics altogether.

Leave a Reply

Your email address will not be published. Required fields are marked *