This is a tutorial to trancribe audio files using DeepSpeech (free) or Google (paid).

Instead of paying for transcriptions, speech recognition engines have been improved to the point where relatively decent automatic transcriptions can be performed for free. These transcriptions can then be edited by a human annotator if required, which would reduce costs.

Mozilla, the open-source community that created Firefox, has developed DeepSpeech, which is an open source Speech-To-Text engine based on a Baidu’s Deep Speech model. They trained it on CommonVoice corpora, an effective approach to collecting data from the community across many languages (you can quickly contribute here). Using the DeepSpeech 0.5.0 model provided below, they achieve 8.22% word error rate on the LibriSpeech clean test corpus.

Below is an fast setup to use DeepSpeech and the paid alternative Google Cloud Text-to-Speech (with a 300-minute free trial). I have found they perform similarly on certain audio files but Google’s engine works considerably better on others (at least for now), although I did not perform any systematic/quantitative comparisons. It’s a matter of trying it out on your audio files and perhaps preprocessing preprocessing them in different ways to try to improve performance.


Install python3

To clone the repo with code found here, open the terminal and run these commands:

git clone
cd transcribe_deepspeech
virtualenv -p python3 env
source env/bin/activate
pip install -r requirements.txt

Download pre-trained DeepSpeech model into ./data/ directory (1.9 GHz). If downloaded into other directory, change path in last parameter. New model paths can be obtained here.

wget -P data/
tar xvfz data/deepspeech-0.5.0-models.tar.gz # Unzips
rm data/deepspeech-0.5.0-models.tar.gz

This will output ./data/deepspeech-0.5.0-models/ directory.

Choose if you want to run DeepSpeech Google Cloud Speech-to-Text or both by setting parameters in

To use Google Cloud API, obtain credentials here (1-year $300 free credit). More info here.

Set paths in Beyond the data and input directories with audio files which you must place and set paths to, it will create all other directories that do not exist already.

It will convert all audio files to 16kHz and 16 bit wav files into a temporary wav_dir, required by DeepSpeech.

Finally, run:

More info:

Example audio files ground-truth transcription

8455-210777-0068.wav: “Your power is sufficient I said”

4507-16021-0012.wav: “Why should one halt on the way?”

2830-3980-0043.wav: “Experience proves this”

DeepSpeech transcription

your power is sufficient i said

why should one halt on the way

*experience proof this

Compare to Google Cloud Speech-to-Text transcription

your power is sufficient I said

why should one halt on the way

experience proves this

That’s it. Enjoy.