Whisper: The model transcribes 2.5 hours of speech in a minute and a half

Whisper Large v3 by OpenAI is an automatic speech recognition (ASR) system trained on millions of hours of speech, with intense human supervision. The model, presented at the time by Sam Altman’s company, is designed to recognize, transcribe and translate spoken languageensuring high precision and great ease of use.

The community has welcomed Whisper so much so that a large number of developers and companies use it to transcribe audio and video tracks (for example meetings and interviews) as well as to support a large number of applications. Whisper Large v3 is supported in the Hugging Face Transformers library, making this tool even more accessible and easily deployable.

The Insanely Fast Whisper project: how it works

The new project was born on GitHub Insanely Fast Whisper which allows you to transcribe 150 minutes of audio in less than 100 seconds. The benchmarks run on a GPU T4 of Google Colab are eloquent: the classic Transfomers take over 31 minutes to transcribe the 150 minutes of digitally recorded speech. Conversely, with Insanely Fast Whisperit is possible to achieve the same result in approximately 1 minute and 38 seconds.

Behind these performance amazing there are some “winning” optimizations such as the use of flash-attn (Flash Attention 2) e del batch processing.

In the automatic transcription of audio, “attention” is a crucial mechanism that determines which parts of the input the algorithm should focus on for text generation purposes. The usual scheme involves a detailed analysis of each input element, slowing down the process in situations where the input sequence is particularly long. Think for example of audio files of significant length. Flash Attention 2 revolutionizes this paradigm by introducing a more efficient mechanism that allows the model to “illuminate” only the relevant parts of the input, significantly reducing the time needed for analysis.

Il batch processingor batch processing, is then a key technique in optimizing the performance of Insanely Fast Whisper. Instead of processing a single item at a time, the model processes a group (or batch) of items simultaneously. This strategy allows you to benefit from parallelism supported by modern graphics processors (GPUs), to use memory efficiently as well as reduceoverhead.

How to use Insanely Fast Whisper from the terminal window

To experience firsthand the amazing performance of Insanely Fast Whisperyou can use the command line interface (CLI, command line interface). The installation is very simple and boils down to the following instruction:

pipx install insanely-fast-whisper

The analysis of file audio and the consequent transcription of the text can then be invoked with the following command:

insanely-fast-whisper --file-name <nomefile o URL>

In the repository GitHub is also available on Python code example that allows you to start processing the input audio file from an application. Direct use of the Whisper library from Python opens up a wide range of possible customizations as well as easy integration in your software projects.

Insanely Fast Whisper is designed to make the most of the GPU Nvidia, ensuring optimal performance on dedicated hardware. The CLI interface is currently only compatible with Nvidia GPUs, but the development team is working to extend support to other architectures.

The future of audio transcription

Insanely Fast Whisper allows you to experiment in the field of audio transcription. With response times so fast, one can imagine real-time applications in areas such as machine translation, live subtitling and much more.

The combination of powerful transformation modelsintelligent optimizations and ease of use, position Insanely Fast Whisper as a reference tool for professionals, researchers and technology enthusiasts.


Please enter your comment!
Please enter your name here