senselab.audio.tasks.speaker_diarization
Speaker diarization
Task Overview
Speaker diarization is the process of segmenting audio recordings by speaker labels, aiming to answer the question: "Who spoke when?"
Models
In senselab
, we integrate pyannote.audio models for speaker diarization. These models can be explored on the Hugging Face Hub. We may integrate additional approaches for speaker diarization into the package in the future.
Evaluation
Metrics
The Diarization Error Rate (DER) is the standard metric for evaluating and comparing speaker diarization systems. It is defined as:
DER= (false alarm + missed detection + confusion) / total
where
false alarm
is the duration of non-speech incorrectly classified as speech, missed detectionmissed detection
is the duration of speech incorrectly classified as non-speech, confusionconfusion
is the duration of speaker confusion, and totaltotal
is the sum over all speakers of their reference speech duration.
Note: DER takes overlapping speech into account. This can lead to increased missed detection rates if the speaker diarization system does not include an overlapping speech detection module.
Benchmark
You can find a benchmark of the latest pyannote.audio model's performance on various time-stamped speech datasets here.