GPT-4V: what it is and how it works according to those who have tried it in preview

In March 2023, OpenAI presented what until now had remained the latest version of the generative model publicly available: GPT-4. A couple of days ago, the company led by Sam Altman explained that ChatGPT becomes capable of seeing, reading and listening. The big news is that the underlying model is now capable of having discussions with the user using the voce and to examine the Images provided as input, with the possibility of generating them in turn and offering them to integrate the answers provided. The basis of the new ChatGPT seems to be there GPT-4Vwith the last letter being “Vision“.

In fact, it is a new generative model that allows you to combine the traditional abilities of GPT-4 with some multimodal capacity. This means that applications built on top of GPT-4V can interact with the user using other methods, even more advanced and integrated with each other. So here is theartificial intelligence it can “understand” and generate not only textual content but also make use of other types of objects, such as images and sounds (the voice).

GPT-4V: OpenAI opens the door to computer vision in the hands of any developer

As already done with GPT-4 and with previous versions of the generative model, GPT-4V will also be usable via APIso as to be easily integrated into any type of application.

With GPT-4V, OpenAI effectively puts an advanced system in the hands of all interested developers artificial vision computerized (computer vision), capable of recognizing the objects depicted in an image and making inferences about them. In other words, the system is, for example, able to examine in detail the structure of any photo or drawing and then carry out precise processing on the contents transmitted as input.

The problems to manage

In a document published by OpenAI in recent days, the company confirms some of the points we made in the article cited at the beginning. Referring to the GPT-4V model, OpenAI confirms that it has so far been mainly exploited by some users of the app Be My Eyeswhich assists visually impaired and blind people by helping them navigate their surroundings.

In an effort to mitigate potential issues, OpenAI has begun working with a number of “red teamers” in order to analyze the model and identify any unwanted behaviour. THE red teamers are professionals or groups of experts responsible for playing a critical role during the testing and evaluation phase of systems, networks or security procedures. This activity is inspired by the military concept of “red teaming“, in which a team (the “red team”, precisely) simulates the role of an enemy or an aggressor to test the resistance and defensive capacity of an organization.

In OpenAI’s technical document, the company explains that it has adopted several security measures to avoid harmful uses of GPT-4V. For example, the template cannot crack CAPTCHAs, it does not allow you to identify people or estimate their age or ethnicity, you should not draw conclusions based on information not present in an image. Furthermore, OpenAI says it has implemented solutions to reduce the prejudices within the model, especially regarding people’s physical appearance, gender and ethnicity.

The errors made by the model

Like all AI models, however, GPT-4V also makes mistakes errors, for example combining multiple strings by inventing a new term; can suffer from hallucinationsinventing facts out of pure tears while reporting them with an authoritative tone, is unable to recognize trivial objects to be detected or certain places.

OpenAI currently prevents using GPT-4V to identify dangerous chemicals from images: the model was found to be unreliable in this context. As part of theimaging Furthermore, in the medical field, GPT-4V highlighted several uncertainties, also in providing answers to questions already the subject of a previous correct answer.

How GPT-4V works: here is the first road test

Roboflow is a platform that facilitates the data preparation process for the creation of computer vision projects. It helps organize, label, and prepare images so they can be used for train models of artificial intelligence. It also generates labels, essential in the training phase; supports conversion of different data and image formats; facilitates the integration of data prepared with various models machine learning, including those based on convolutional neural networks. Try visiting the project home page to see what Roboflow can do and what automations it helps to implement.

Well, James Gallagher and Piotr Skalski say they got their hands on the GPT-4V in advance and tested its operation quite thoroughly. The “duo” put the GPT-4V to the test on various questions, even in a rather “smart” way.

For example, as the first input for GPT-4V they used a meme which combined technology-related terms with an image made up of food. OpenAI’s model managed to correctly describe why the image was funny, referring to the various image components and how they connect. GPT-4V made a small but, all in all, minor error.

In the following ones test, GPT-4V successfully recognized the denomination and type of a coin while also describing what was depicted. He also accurately detected a Polish currency by calculating the value of pennies placed on a table.

Recognizing movie images, cities, plants and solving mathematical problems

Sending a frame of the film to GPT-4V Pulp Fiction, the model correctly recognized the cinematography film and offered a high-level description. It then correctly recognized the name of a city starting from a panoramic photo, provided instructions for caring for a house plant, correctly activated the OCR functions when necessary, correctly solved a trigonometry problem starting from a photo taken with your smartphone.

Object detection

In the case of object recognitionGPT-4V was not only able to detect them but, for example, returned the coordinates within a photo upon request.

Contrary to what OpenAI indicated, Roboflow reports that the model still tries to recognize the objects depicted in the CAPTCHA despite making some mistakes. For example, it was not able to establish the position of the classic traffic lights in the Google CAPTCHA.

GPT-4V also provided mediocre performance in solving Crosswords and Sudoku: the answers were incorrect.

Conclusions

GPT-4V performed well in various general imaging questions and proved context awareness in the vast majority of cases. Roboflow explains that for answers to general questions, GPT-4V can really be exciting. While useful templates for this purpose existed in the past, they often lacked fluidity in the answers. The GPT-4V can both answer questions and follow up on questions about an image, and do so in depth.

OpenAI’s new model allows you to ask questions about an image without creating a two-step process. There is no need to classify an image first and then ask a question to a linguistic model. The limitations there is no shortage of them, but the leap forward made with an instrument like the GPT-4V is truly remarkable.