Programming

Google Magika recognizes files in the blink of an eye, thanks to AI

Google Magika recognizes files in the blink of an eye, thanks to AI

Since the dawn of computing, accurately detecting i file types with which one has to deal has been crucial in many contexts. Linux comes with libmagic, a library that integrates file type identification capabilities, often used to determine the content type of a file based on its structure and the first few pieces of binary data. In essence, libmagic helps to recognize the format of a file by analyzing the initial bytes of the file itself or its header.

When you need to recognize the type of a file, it is never a good idea to stop at it extension: The presence of a .pdf, .docx, .txt, etc. may not truly represent what the file contains. Before talking about damaged files, it is best to check their format with the utmost care.

The problem of identifying the type of each file

Operating systems, web browsers, code editors, and countless other software rely on file type detection to decide how to handle an object. For example, the code editor modern ones use file type detection to choose which scheme of syntax highlighting must activate. The approach, in fact, changes depending on the programming language.

Until now, libmagic and most other file type identification software have relied on a collection of heuristic rules and custom solutions to detect each file format. It is a mode of operation expensive and error-prone because it is complicated to develop rules that apply to a wide range of situations.

Google Magika, the file detector powered by artificial intelligence

To address the problem and provide quick and accurate identification of all kinds of files, Google has developed Magic. This demonstration page, specially prepared to show how the new tool works, makes Magika’s abilities clear.

The system designed and created by the technicians of the Mountain View company helps to recognize binary and textual file types within a handful of milliseconds, even when run locally using a “simple” CPU, without therefore resorting to a GPU.

Magika uses a model of deep learning customized and highly optimized, in turn trained using Keras (weighs only about 1 MB). For the activity of inference (process of applying an already trained model to make predictions or classifications on new data), Magika is based on Onnx (Open Neural Network Exchange), a file format and ecosystem introduced by Microsoft and Facebook in 2017 that allows you to build, train, and deploy artificial intelligence (AI) models interoperably across different frameworks and platforms.

We talked about it in the article dedicated to how to remove the background from an image using an open source library and artificial intelligence.

How to try Magika, from the Web or using Python code

You can try the Magika web demo today or install it as Python libraries, with the possibility of using the tool also from the command line. Just give the command pip install magika.

From a performance perspective, Magika, thanks to its AI model and large training dataset, is able to outperform other existing tools by around 20%. The final opinion, shared by Google, refers to checks conducted on a total of 1 million files which include over 100 file types.

When the extension of a file is missing, to recognize it and trace its type we have previously introduced the utility TrID. If you examine the table published by Google, Magika outperforms all “competitors” in correct file detection, including TrID. The gap becomes more pronounced when it comes to analyzing text files, programming code, and configuration files, all of which other tools can struggle with.

File type recognition with Google Magika

What Google Magika is and will be used for

Google is already using Magika on a large scale to improve user security by detecting the content of mail attachments on Gmail, of files published on Driveto improve the behavior of Safe Browsing at the web browser level.

Magika will soon also be integrated with VirusTotal, completing the functionality Code Insight of the platform, which uses Google’s generative AI to analyze and detect malicious code. Magika will act as a pre-filter before the files are scanned by Code Insightimproving the efficiency and accuracy of the responses produced by the platform.

However, since it is a open source product distributed under the Apache2 license (the source code is published on GitHub), Magika can become a commonly used tool – which can be integrated into other third-party software and platforms – useful for improving the accuracy of file identification.

Opening image credit: iStock.com – Royalty Free

Leave a Reply

Your email address will not be published. Required fields are marked *