When it comes to artificial intelligences and generative models, the problem of hallucinations is quite common. A’hallucination it is the phenomenon that leads to the production of output unrealistic, incorrect or misleading.
What is a hallucination in the case of artificial intelligences
If you have used some generative model or simply the chatbot ChatGPT of OpenAI (or other similar products), you will certainly have come across implausible texts produced by artificial intelligence. The output, in this case, convey information that may seem reliable only at first glance but which does not find any real confirmation. Think, for example, of invented facts or information referring to something that does not exist.
The models of image generation they could produce objects, people or scenarios that do not exist in reality: think of distorted human faces or objects with impossible physical characteristics.
Another form of hallucination can occur when the pattern spawns output that reflect too closely the training data: in this case we detect the creation of copies of the starting information or combinations of pre-existing data rather than original ideas.
The causes of hallucinations can arise from several factors. If the model was not trained with a sufficient amount of realistic data e diversified; if it was overly complex, with a tendency to over-fit the training data; if it were based on an algorithm in itself prone to producing hallucinations.
Vectara Hallucination Evaluation Model (HEM) evaluates the performance of any generative model
Vectara developed and presented an open source tool called “Hallucination Evaluation Model” (HEM) which evaluates how frequently LLM generative models (Large Language Model) manifest the problem of hallucinations.
In the table drawn up by the technicians of Vectarathe values in the column Answer Rate represents the percentage of times the model actually tried to generate a response or a summary based on the data retrieved and available for the question asked input. The precision and thereliability of responses, are instead measured with metrics such as Accuracy e Hallucination Ratealso present in the table.
Model | Answer Rate | Accuracy | Hallucination Rate | Average Summary Length |
GPT-4 | 100% | 97,0% | 3,0% | 81,1 parole |
GPT-3.5 | 99,6% | 96,5% | 3,5% | 84,1 parole |
Call 2 70B | 99,9% | 94,9% | 5,1% | 84,9 parole |
Llama 2 7B | 99,6% | 94,4% | 5,6% | 119,9 parole |
Call 2 13B | 99,8% | 94,1% | 5,9% | 82,1 parole |
Cohere-Chat | 98,0% | 92,5% | 7,5% | 74,4 parole |
Cohere | 99,8% | 91,5% | 8,5% | 59,8 parole |
Anthropic Claude 2 | 99,3% | 91,5% | 8,5% | 87,5 parole |
Mistral 7B | 98,7% | 90,6% | 9,4% | 96,1 parole |
Google Palm | 92,4% | 87,9% | 12,1% | 36,2 parole |
Google Palm-Chat | 88,8% | 72,8% | 27,2% | 221,1 parole |
With the expression “Average Summary Length“, Vectara it refers instead to medium length of the texts produced by each individual model.
The real skill of generative models lies in the RAG approach
Second VectaraHowever, the real power of modern linguistic models lies and will increasingly lie in the so-called approach RAG (Retrieval Augmented Generation). This is the ability of artificial intelligences to interact with external sources of knowledge in order to integrate and improve the internal representation of the knowledge already contained in each LLM model. Precisely the use of the RAG scheme, again for technicians Vectarawould have the positive effect of reduce hallucinations.
The fundamental idea of the RAG process is to enrich the generative process introducing a preliminary operation to recover relevant data. This approach aims to improve the quality and consistency of the generated output, as the generative model is fed with specific and relevant information extracted via the initial retrieval operation.
RAG leverages relevant information gained from recovery to guide and influence the generation processproviding context and support to the generative model.
To get to the data contained in the table Vectara, which we republished just above, the company “fed” 1,000 short documents to the various LM models mentioned in the first column and then asked them to summarize each document using only the content that appears in the document itself. In 831 cases out of 1,000, all language models created a quality, relevant and satisfactory summary. In the other cases it didn’t go so well, with the generation of hallucinations.
The platform proposed and used by Vectara is publicly available on Hugging Face. Anyone can verify the functioning of the HEM model and carry out tests in total autonomy.
Opening image credit: iStock.com/da-kuk