Hallucination: What it is and how Vectara measures the common AI problem

When it comes to artificial intelligences and generative models, the problem of hallucinations is quite common. A’hallucination it is the phenomenon that leads to the production of output unrealistic, incorrect or misleading.

What is a hallucination in the case of artificial intelligences

If you have used some generative model or simply the chatbot ChatGPT of OpenAI (or other similar products), you will certainly have come across implausible texts produced by artificial intelligence. The output, in this case, convey information that may seem reliable only at first glance but which does not find any real confirmation. Think, for example, of invented facts or information referring to something that does not exist.

The models of image generation they could produce objects, people or scenarios that do not exist in reality: think of distorted human faces or objects with impossible physical characteristics.

Another form of hallucination can occur when the pattern spawns output that reflect too closely the training data: in this case we detect the creation of copies of the starting information or combinations of pre-existing data rather than original ideas.

The causes of hallucinations can arise from several factors. If the model was not trained with a sufficient amount of realistic data e diversified; if it was overly complex, with a tendency to over-fit the training data; if it were based on an algorithm in itself prone to producing hallucinations.

Vectara Hallucination Evaluation Model (HEM) evaluates the performance of any generative model

Vectara developed and presented an open source tool called “Hallucination Evaluation Model” (HEM) which evaluates how frequently LLM generative models (Large Language Model) manifest the problem of hallucinations.

In the table drawn up by the technicians of Vectarathe values ​​in the column Answer Rate represents the percentage of times the model actually tried to generate a response or a summary based on the data retrieved and available for the question asked input. The precision and thereliability of responses, are instead measured with metrics such as Accuracy e Hallucination Ratealso present in the table.

Model Answer Rate Accuracy Hallucination Rate Average Summary Length
GPT-4 100% 97,0% 3,0% 81,1 parole
GPT-3.5 99,6% 96,5% 3,5% 84,1 parole
Call 2 70B 99,9% 94,9% 5,1% 84,9 parole
Llama 2 7B 99,6% 94,4% 5,6% 119,9 parole
Call 2 13B 99,8% 94,1% 5,9% 82,1 parole
Cohere-Chat 98,0% 92,5% 7,5% 74,4 parole
Cohere 99,8% 91,5% 8,5% 59,8 parole
Anthropic Claude 2 99,3% 91,5% 8,5% 87,5 parole
Mistral 7B 98,7% 90,6% 9,4% 96,1 parole
Google Palm 92,4% 87,9% 12,1% 36,2 parole
Google Palm-Chat 88,8% 72,8% 27,2% 221,1 parole

With the expression “Average Summary Length“, Vectara it refers instead to medium length of the texts produced by each individual model.

The real skill of generative models lies in the RAG approach

Second VectaraHowever, the real power of modern linguistic models lies and will increasingly lie in the so-called approach RAG (Retrieval Augmented Generation). This is the ability of artificial intelligences to interact with external sources of knowledge in order to integrate and improve the internal representation of the knowledge already contained in each LLM model. Precisely the use of the RAG scheme, again for technicians Vectarawould have the positive effect of reduce hallucinations.

The fundamental idea of ​​the RAG process is to enrich the generative process introducing a preliminary operation to recover relevant data. This approach aims to improve the quality and consistency of the generated output, as the generative model is fed with specific and relevant information extracted via the initial retrieval operation.

RAG leverages relevant information gained from recovery to guide and influence the generation processproviding context and support to the generative model.

To get to the data contained in the table Vectara, which we republished just above, the company “fed” 1,000 short documents to the various LM models mentioned in the first column and then asked them to summarize each document using only the content that appears in the document itself. In 831 cases out of 1,000, all language models created a quality, relevant and satisfactory summary. In the other cases it didn’t go so well, with the generation of hallucinations.

The platform proposed and used by Vectara is publicly available on Hugging Face. Anyone can verify the functioning of the HEM model and carry out tests in total autonomy.

Opening image credit: iStock.com/da-kuk


Please enter your comment!
Please enter your name here