senselab.audio.tasks.classification

Audio Classification

Task Overview

Audio Classification refers to the broad category of tasks of processing audio inputs and assigning a label over a set of possible classes. The technology has a wide range applications, including identifying speaker intent, language classification, and even animal species by their sounds.

Models

A variety of models are supported by senselab for audio classification tasks. They can be explored on the Hugging Face Hub. Each model varies in performance, size, license, language support, and more. Like with many models and tasks, performance may also vary depending on who the speaker is in the processed audio clips (there may be differences in terms of age, dialects, disfluencies). It is recommended to review the model card for each model before use. Also, always refer to the most recent literature for an informed decision. Unlike other tasks, the exact classification task will be based off of the dataset and class labels used to train the model, such that even two models using the same dataset might classify the audios into different categories (e.g. the LibriSpeech dataset could be used to classify age and/or gender based on the audio clips).

Some popular models for different classifications include:

Age

Gender

Music Genre

Adult vs. Child Speech

Emotion Recognition

Evaluation

Metrics

The primary evaluation metric for audio classification is accuracy. This can be unweighted (the number of correct labels over the total number of labels) or weighted (where each class is weighted accordingly to how much of the whole dataset the class label represents).

Other important metrics include:

Area Under the Curve (AUC)
Precision
Recall
F1-Score

More information about the different metrics related to classification can be found here as well as a common function for understanding the results of a classification can be found in scikit-learn.

Datasets

The following table lists the datasets included in the English Speech Benchmark (ESB), which are generally used for evaluating ASR models in English:

Dataset	Domain	Amount of Data (h)
AudioSet	Audio Events/Ontology	5.8 thousand
ICBHI Respiratory Sound Database	Medical/Respiratory	5.5
ESC-50	Environmental Audio	2.77
MSP Podcast	Speech Emotion Recognition	238

Note that this list of datasets is not exhaustive. If you are interested in benchmarking models in different languages or under specific conditions, consult the relevant literature.

View Source

1""".. include:: ./doc.md"""  # noqa: D415
2
3from .api import classify_audios  # noqa: F401