senselab.audio.tasks.speech_to_text

Speech to text

Task Overview

Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR), is the process of converting spoken language into written text. This technology has a wide range of applications, including transcription services and voice user interfaces.

Notably, certain models can provide word- or sentence-level timestamps along with the transcribed text, making them ideal for generating subtitles. Additionally, some models are multilingual, and some of them leverage language identification blocks to enhance performance.

Models

A variety of models are supported by senselab for ASR tasks. They can be explored on the Hugging Face Hub. Each model varies in performance, size, license, language support, and more. Performance may also vary depending on who is the speaker in the processed audio clips (there may be differences in terms of age, dialects, disfluencies). It is recommended to review the model card for each model before use. Also, always refer to the most recent literature for an informed decision.

Popular models include:

Whisper
- tiny
- small
- medium
- large

Massively Multilingual Speech
- MMS 1b
Massively Multilingual and Multimodal Machine Translation

Evaluation

Metrics

The primary evaluation metric for ASR systems is the Word Error Rate (WER). WER is calculated as:

WER = (Substitutions + Insertions + Deletions) / Number of words in the reference

where:

Substitutions: Incorrect words.
Insertions: Extra words added.
Deletions: Words omitted.

Other important metrics include:

Character Error Rate (CER)
Match Error Rate (MER)
Word Information Lost (WIL)
Word Information Preserved (WIP)

For detailed information on these metrics, refer to the speech to text evaluation module.

Datasets (English Speech Benchmark - ESB)

The following table lists the datasets included in the English Speech Benchmark (ESB), which are generally used for evaluating ASR models in English:

Dataset	Domain	Speaking Style	Train (h)	Dev (h)	Test (h)	Transcriptions	License
LibriSpeech	Audiobook	Narrated	960	11	11	Normalized	CC-BY-4.0
Common Voice 9	Wikipedia	Narrated	1409	27	27	Punctuated & Cased	CC0-1.0
VoxPopuli	European Parliament	Oratory	523	5	5	Punctuated	CC0
TED-LIUM	TED talks	Oratory	454	2	3	Normalized	CC-BY-NC-ND 3.0
GigaSpeech	Audiobook, podcast, YouTube	Narrated, spontaneous	2500	12	40	Punctuated	Apache-2.0
SPGISpeech	Financial meetings	Oratory, spontaneous	4900	100	100	Punctuated & Cased	User Agreement
Earnings-22	Financial meetings	Oratory, spontaneous	105	5	5	Punctuated & Cased	CC-BY-SA-4.0
AMI	Meetings	Spontaneous	78	9	9	Punctuated & Cased	CC-BY-4.0

For more details on these datasets and how models are evaluated to obtain the ESB score, refer to the ESB paper. Note that this list of datasets is not exhaustive. If you are interested in benchmarking models in different languages or under specific conditions, consult the relevant literature.

Benchmark

The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models available on the Hugging Face Hub. The leaderboard uses the datasets included in the ESB paper to obtain robust evaluation scores for each model. The ESB score is a macro-average of the WER scores across the ESB datasets, providing a comprehensive indication of a model's performance across various domains and conditions.

Notes

It is possible to fine-tune foundational speech models on a specific language without requiring large amounts of data. A detailed blog post on how to fine-tune a pre-trained Whisper checkpoint on labeled data for ASR can be found here.

Learn more:

View Source

1""".. include:: ./doc.md"""  # noqa: D415
2
3from .api import transcribe_audios  # noqa: F401