With the PaliGemma VLM, your computer understands the content of the images

With the PaliGemma VLM, your computer understands the content of the images

The applications of artificial vision o computer vision they are certainly nothing new. With the advent of generative models based onartificial intelligence increasingly performing, however, scenarios that were unthinkable just a short time ago are gradually opening up. PaliGemma is a new family of VLMs (Vision Language Model) just presented by Google. They are models that receive as input an image accompanied by a request in natural language and produce, in turn, a response in textual format.

What is a VLM and how Google PaliGem works

And VLM is a type of AI-based model designed to simultaneously understand and manipulate both visual and linguistic information. It is capable of analyzing images and text in an integrated way, allowing you to answer questions about content of the imagesgenerate automatic descriptions for images, and perform a wide range of other tasks that involve “understanding” and using both modes of communication.

Tools like VLMs find wide application in several industries, including healthcare, industrial automation, computer vision, robotics and much more. This opens up extensive possibilities interaction between cars and the surrounding world, with the possibility of pushing the acceleration on innovation. Seizing the new opportunities offered by artificial intelligence.

PaliGemma emerges as a tool pioneering in the field of VLM, offering a deep “understanding” of both images and text.

PaliGemma operation

The architecture of PaliGemma

A tool like PaliGemma represents the convergence between language processing and computer vision in the domain of artificial intelligence. Its heart contains a sophisticated architecture that has the ability to interpret both images and text simultaneously.

The PT, Mix and FT Checkpoint are categories of models within the PaliGemma family: each of them is designed to specific purposes and has distinctive characteristics. THE PT Checkpoint, for example, are pre-trained on a large set of data, generally of a general and diverse nature. They are designed as a starting point for further specific training on more targeted tasks.

PT checkpoints are flexible and can be adapted to a wide range of tasks through the process of fine-tuningwhere the model is further trained on specific data for tasks such as image captioning, visual question answering, object recognition, and so on.

I Mix Checkpoint they are trained on a wide range of data and tasks, making them more versatile and suitable for research and experimental applications. They are particularly suitable for general use and interaction with prompt of free text, allowing users to explore the model’s capabilities on a variety of tasks without having to perform a fine-tuning preliminary.

By accessing the use of FT Checkpointyou benefit from models previously subjected to a process of fine-tuning advanced, carried out on specific tasks. They are optimized to ensure high performance with academic benchmarks. For this reason, they are mainly intended for research purposes and applications in the university field.

For more information, you can refer to the presentation published in this article which appeared on the Hugging Face blog.

The capabilities of the model

We have already said that PaliGemma allows you to tackle a wide range of different tasks, depending on the instructions provided by the user. When using a PaliGemma model, you can configure the type of task you want to perform. Detection and segmentation activities can thus be carried out.

L’detection activities it is a fundamental operation in the field of computer vision which consists in identifying the presence and position of objects of interest in an image. The process of segmentation of entities refers to a technique used in computer vision to identify and separate distinct objects within an image. It is particularly useful when you want to gain a detailed understanding of the various entities present in an image and isolate specific objects for further analysis and processing.

PaliGemma excels in the generation of descriptive captions for images, providing context and valuable information. As already highlighted previously, by combining the “understanding” of images with the “understanding” of text, the model can effectively answer questions about the content of images.

Finally, in addition to the aforementioned segmentation abilities, PaliGemma can also work on the content of documents by exploiting the optical character recognition (OCR).

How to use PaliGemma and unlock computer vision abilities on any device

Try PaliGemma it is possible quickly and easily by visiting the demo on Hugging Face. The demonstration application receives selectable images as input directly on the user’s device. Alternatively, however, one can refer to the examples at the bottom of the page. Next to each of them there is also the prompt used by PaliGamma authors to carry out, for example, detection and segmentation tasks.

It is not strictly necessary to use English to interface with PaliGemma but “magic prefixes” such as detect e segment they prove crucial, for example, to detect objects and obtain a response containing the positioning of each of them.

Object recognition in images

As we can see, PaliGemma is comfortable with slightly more complex questions. By submitting an image like the one in the figure and asking “who is depicted in the mural“, the Google model responds immediately “David Bowie“, without any uncertainty.

Prompt in Europen VLM PaliGemma

Through the drop-down menus Model e Decoding, you can choose different items. The options “paligemma-3b-mix-224” e “paligemma-3b-mix-448” refer to PaliGemma models pre-trained on a mix of tasks e dimensions image specifications. The former is optimized for processing smaller images and is designed to be lighter in terms of memory requirements. The other, however, offers support for higher resolutions and may prove to be more suitable for all those applications that require the examination of finer details.

As for decoding optionsgreedy, nucleus(0.1), nucleus(0.3), temperature(0.5) – these concern the process of generating the text by the model. The choice greedy indicates that the model will always select the most likely word as next token during text generation. It is a simple decoding strategy but can lead to less diverse results. We talk about it in the article on how LLMs work in the case of greedy selection.

Conversely, the nucleus samplingalso known as top-p sampling, selects tokens from the set that exceeds a predetermined threshold value. This approach allows for greater diversity in results than greedy sampling. Finally, one temperature higher leads to introducing greater creativity in the functioning of the model and a higher degree of randomness in the generation of theoutput finale.

PaliGemma can also be used in your own applications: how to do it with Python code

Using the instructions published in the paragraph How to run Inferenceit is possible to exploit the PaliGemma models within your own applications using, for example, simple Python code. Specifically, the class PaliGemmaForConditionalGeneration allows you to interact with models and get responses by sending the image and the prompt to be processed.

Below is an example of Python code that uses the PaliGemma model to generate text starting from prompt text and images. Before proceeding, you need to make sure you have installed the necessary libraries, such as transformers e torch.

import torch
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests

# Load the model and processor
model_id = “google/paligemma-3b-mix-224″
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

# Function to generate text from a prompt and an image
def generate_text(prompt, image_url, max_tokens=50):
# Load and pre-process the image
image =, stream=True).raw)
inputs = processor(prompt=prompt, images=image, return_tensors=”pt”)

# Generate text
output = model.generate(**inputs, max_length=max_tokens)

# Decode the text
generated_text = processor.decode(output[0]skip_special_tokens=True)

return generated_text

# Prompt and example image URL
prompt = “Describe what is depicted in this image”
image_url = “”

# Generate text from the prompt and image
generated_text = generate_text(prompt, image_url)

# Display the generated text
print(“Generated text:”, generated_text)

Instead of the URL indicated, you must obviously specify the address of the image you wish to process. The Python code…

Leave a Reply

Your email address will not be published. Required fields are marked *