senselab.audio.tasks.text_to_speech
Text to Speech
Task Overview
Text-to-speech (TTS) is the task of creating natural-sounding speech from text. This process can be performed in multiple languages and for multiple speakers.
Models
A variety of models are supported by senselab
for text-to-speech.
Each model varies in performance, size, license, language support, and more. Performance may also vary depending, among other reasons, on the length of the text or the target speaker (differences in terms of age, dialects, disfluencies). It is recommended to review the model card for each model before use and refer to the most recent literature for an informed decision.
Several text-to-speech models are currently available through 🤗 Transformers
. These models can be explored on the Hugging Face Hub.
Note: Some Hugging Face
models, despite having the text-to-speech
label on their model cards, may not work with the text-to-speech pipeline. These models are not supported in senselab
, and identifying them often requires trial and error.
In addition to the models from 🤗 Transformers, senselab also supports Mars5-TTS
and coqui-tts
, which enable text-to-speech generation (sometimes using a specific target voice accompanied by its corresponding transcript). Voice cloning using a target voice refers to the process of creating a synthetic voice that mimics the characteristics of a specific person's voice, known as the target voice. This involves generating speech that sounds like it was spoken by that person, even though it was produced by a machine.
Popular/recommended models include:
Evaluation
Metrics
For assessing speech quality and intelligibility, we can use quantitative metrics such as:
- Wideband Perceptual Estimation of Speech Quality (PESQ)
- Short-Time Objective Intelligibility (STOI)
- Scale-Invariant Signal-to-Distortion Ratio (SI-SDR)
and qualitative metrics such as:
- Mean Opinion Score (MOS)
Another way to automatically assess the intelligibility of the synthesis is by transcribing the output audio (trusting the ASR system) and computing the Word Error Rate (WER) with the reference text.
Also, if targeting a specific speaker's voice, we can perform speaker verification to assess how closely the generated audio matches the target voice. If there are specific features in the target voice that we aim to maintain, we can extract these features from the generated audio and verify their presence.
senselab
can help with all of these evaluations.
Datasets
To train and evaluate TTS models, a variety of datasets can be used. Some popular datasets include:
- LJSpeech: A dataset of single-speaker English speech.
- LibriTTS: A multi-speaker English dataset derived from the LibriVox project.
- VCTK: A multi-speaker English dataset with various accents.
- Common Voice: A multi-language dataset collected by Mozilla.
Benchmark
The TTS Arena ranks and evaluates text-to-speech models available based on human perception. For automated benchmarking, we recommend using standard datasets and metrics mentioned above.