senselab.audio.tasks.speech_to_text
Speech to text
Task Overview
Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR), is the process of converting spoken language into written text. This technology has a wide range of applications, including transcription services and voice user interfaces.
Notably, certain models can provide word- or sentence-level timestamps along with the transcribed text, making them ideal for generating subtitles. Additionally, some models are multilingual, and some of them leverage language identification blocks to enhance performance.
Models
A variety of models are supported by senselab
for ASR tasks. They can be explored on the Hugging Face Hub. Each model varies in performance, size, license, language support, and more. Performance may also vary depending on who is the speaker in the processed audio clips (there may be differences in terms of age, dialects, disfluencies). It is recommended to review the model card for each model before use. Also, always refer to the most recent literature for an informed decision.
Popular models include:
Massively Multilingual Speech
Massively Multilingual and Multimodal Machine Translation
Evaluation
Metrics
The primary evaluation metric for ASR systems is the Word Error Rate (WER). WER is calculated as:
WER = (Substitutions + Insertions + Deletions) / Number of words in the reference
where:
- Substitutions: Incorrect words.
- Insertions: Extra words added.
- Deletions: Words omitted.
Other important metrics include:
- Character Error Rate (CER)
- Match Error Rate (MER)
- Word Information Lost (WIL)
- Word Information Preserved (WIP)
For detailed information on these metrics, refer to the speech to text evaluation module.
Datasets (English Speech Benchmark - ESB)
The following table lists the datasets included in the English Speech Benchmark (ESB), which are generally used for evaluating ASR models in English:
Dataset | Domain | Speaking Style | Train (h) | Dev (h) | Test (h) | Transcriptions | License |
---|---|---|---|---|---|---|---|
LibriSpeech | Audiobook | Narrated | 960 | 11 | 11 | Normalized | CC-BY-4.0 |
Common Voice 9 | Wikipedia | Narrated | 1409 | 27 | 27 | Punctuated & Cased | CC0-1.0 |
VoxPopuli | European Parliament | Oratory | 523 | 5 | 5 | Punctuated | CC0 |
TED-LIUM | TED talks | Oratory | 454 | 2 | 3 | Normalized | CC-BY-NC-ND 3.0 |
GigaSpeech | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500 | 12 | 40 | Punctuated | Apache-2.0 |
SPGISpeech | Financial meetings | Oratory, spontaneous | 4900 | 100 | 100 | Punctuated & Cased | User Agreement |
Earnings-22 | Financial meetings | Oratory, spontaneous | 105 | 5 | 5 | Punctuated & Cased | CC-BY-SA-4.0 |
AMI | Meetings | Spontaneous | 78 | 9 | 9 | Punctuated & Cased | CC-BY-4.0 |
For more details on these datasets and how models are evaluated to obtain the ESB score, refer to the ESB paper. Note that this list of datasets is not exhaustive. If you are interested in benchmarking models in different languages or under specific conditions, consult the relevant literature.
Benchmark
The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models available on the Hugging Face Hub. The leaderboard uses the datasets included in the ESB paper to obtain robust evaluation scores for each model. The ESB score is a macro-average of the WER scores across the ESB datasets, providing a comprehensive indication of a model's performance across various domains and conditions.
Notes
- It is possible to fine-tune foundational speech models on a specific language without requiring large amounts of data. A detailed blog post on how to fine-tune a pre-trained Whisper checkpoint on labeled data for ASR can be found here.
Learn more: