senselab.audio.tasks.features_extraction
Features extraction
Task Overview
This module provides the API for extracting voice and speech features from audio recordings using the senselab
package. Features span multiple speech subsystems/clincal constructs—such as fluency, respiration, phonation, articulation, and spectral characteristics—and are derived using trusted libraries including Praat-Parselmouth
, OpenSMILE
, Torchaudio
, and Torchaudio-SQUIM
.
The following table summarizes the currently supported features, categorized by speech subsystem or clinical construct, and includes a description, units, implementation reference, and implementation status.
Speech Subsystem/ Clinical construct | Feature | Description | Unit | Implementation | Implemented |
---|---|---|---|---|---|
Fluency | Duration | Total length of the audio recording | sec | Praat Parselmouth (docs) | ✅ |
Fluency | Phonation Time | Length of all phonated sounds within the audio | sec | N/A | No |
Fluency | Phonation Ratio | Phonation time divided by duration | -- | Praat Parselmouth (docs) | ✅ |
Fluency | Mean Phrase Duration | Average duration of a phrase (continuous speech between pauses) | sec | N/A | No |
Fluency | Coefficient of Variance Phrase Duration | Normalized variability of phrase durations | sec | N/A | No |
Fluency | Number of Spoken Units | Number of spoken units (phonemes, syllables, or words) identified in the audio | -- | N/A | No |
Fluency | Mean Unit Duration | Phonation Time divided by the number of spoken units | sec | N/A | No |
Fluency | Speaking Rate | Number of spoken units divided by duration | unit sec⁻¹ | Praat Parselmouth (docs) | ✅ |
Fluency | Articulation Rate | Number of spoken units divided by phonation time | unit sec⁻¹ | Praat Parselmouth (docs) | ✅ |
Fluency | Mean Length of Run | Average number of units produced in runs between silences | -- | N/A | No |
Fluency | Number Pauses | Number of filled/silent pauses in a recording | unit | N/A | No |
Fluency | Pause Rate | Number of pauses divided by duration | unit⁻¹ | Praat Parselmouth ( docs) | ✅ |
Fluency | Pause Ratio | Total pause time divided by audio recording duration | -- | N/A | No |
Fluency | Mean Pause Duration | Average duration of pauses (filled/silent) | sec | Praat Parselmouth ( docs) | ✅ |
Fluency | Coefficient of Variance Pause Duration | Normalized variability in pause durations | -- | N/A | No |
Fluency | Mean Phone Length | Average duration of phones | sec | N/A | No |
Fluency | Phoneme-Dependent Duration | Linear combination of average phone durations | sec | N/A | No |
Fluency | Voice Onset Time (VOT) | Time between release of a stop consonant and the onset of vocal fold vibration | sec | N/A | No |
Fluency | Maximum Phonation Time | Maximum duration of a continuous phonation (usually a vowel) | sec | N/A | No |
Fluency | Pairwise Variability Index | Temporal variability between successive speech unit intervals | -- | N/A | No |
Respiration | Intensity | Sum of the squares of the signal amplitude (approximates loudness) | dB | Praat Parselmouth (docs) | ✅ |
Respiration | Intensity Range | Range of loudness values in a speech signal | -- | Praat Parselmouth ( docs) | ✅ |
Respiration | Voice Range Profile (VRP) | Minimum and maximum intensity across a set of frequencies | dBHz⁻¹ | N/A | No |
Respiration | Number of Breath Events | Count of inhalations in a recording | -- | N/A | No |
Respiration | Speech Respiration Rate | Respiratory rate during speech | unit⁻¹ | N/A | No |
Respiration | Speech Tidal Volume | Amount of air inhaled during a typical breath for speech | mL | N/A | No |
Respiration | Pause Intervals per Respiration | Measure of breathing periodicity | -- | N/A | No |
Respiration | Relative Loudness of Respiration | Ratio of respiration loudness relative to speech intensity | -- | N/A | No |
Respiration | Respiratory Exchange Latency | Time interval between expiration and the subsequent inspiration | s | N/A | No |
Phonation | Fundamental Frequency (F0) | Rate of vocal-fold vibration (perceived as pitch) | Hz | Praat Parselmouth (docs) | ✅ |
Phonation | Pitch Sigma | Standard deviation of F0, expressed in semitones | Semitones | N/A | No |
Phonation | Jitter (Absolute) | Average absolute difference between consecutive F0 periods | sec | Praat Parselmouth (docs) | ✅ |
Phonation | Jitter (Relative) | Absolute jitter divided by the average F0 period | % | Praat Parselmouth (docs) | ✅ |
Phonation | Shimmer (local) | Average absolute amplitude difference between consecutive F0 periods (relative measure) | % | Praat Parselmouth (docs) | ✅ |
Phonation | Shimmer (dB) | Difference in amplitude between consecutive F0 periods, expressed in dB | dB | Praat Parselmouth (docs) | ✅ |
Phonation | Harmonic to Noise Ratio | Ratio of harmonic energy to noise energy in voiced segments | dB | Praat Parselmouth (docs) | ✅ |
Phonation | Percentage of Unvoiced Frames | Fraction of pitch frames detected as unvoiced | % | N/A | No |
Phonation | Number of Voice Breaks | Count of interruptions in the fundamental period during sustained phonation | -- | N/A | No |
Phonation | Degree of Voice Breaks | Total duration of voice breaks relative to total signal duration | % | N/A | No |
Phonation | Hammarberg Index | Difference between dominant frequencies in two spectral ranges (0–2000 Hz and 2000–5000 Hz) | Hz | N/A | No |
Phonation | Spectral Slope | Slope of the long-term average spectrum | dB/octave | Praat Parselmouth (docs) | ✅ |
Phonation | Spectral Tilt | Tilt of the regression line through the long-term average spectrum | -- | Praat Parselmouth (docs) | ✅ |
Phonation | Cepstral Peak Prominence | Integrative measure of temporal aperiodicity and spectral variation | dB | Praat Parselmouth (docs) | ✅ |
Phonation | H1–H2 | Difference between the levels of the first two harmonics | dB | N/A | No |
Phonation | H1-H2 | Difference between the first two harmonics after removing formant influence | dB | N/A | No |
Phonation | Harmonic Richness Factor | Amplitude relationship between the fundamental and higher harmonics | dB | N/A | No |
Phonation | Parabolic Spectral Parameter | Quantifies the spectral decay of the voice source | -- | N/A | No |
Phonation | Open Quotient | Ratio of the open phase of the glottal pulse to the fundamental period | -- | N/A | No |
Phonation | Closing Quotient | Ratio of the glottal closing phase to the fundamental period | -- | N/A | No |
Phonation | Speed Quotient | Ratio between the durations of glottal opening and closing phases | -- | N/A | No |
Phonation | Normalized Amplitude Quotient | Ratio between the amplitude of the airflow and the peak flow derivative, normalized by period length | -- | N/A | No |
Articulation | Formant Frequencies | Center frequencies of vocal tract resonance peaks | Hz | Praat Parselmouth (docs) | ✅ |
Articulation | Formant Bandwidths | Width of the spectral peak (3 dB down from the resonance peak) | Hz | Praat Parselmouth (docs) | ✅ |
Articulation | Formant Slopes | Rate of change in formant frequencies over time | Hz/ms | N/A | No |
Articulation | Vocal Tract Coordination | Cross-correlation between formant trajectories at set time delays | -- | N/A | No |
Articulation | Vowel Space Area | Area of the quadrilateral defined by the four corner vowels in the F1–F2 space | -- | N/A | No |
Articulation | Formant Centralization Ratio (FCR) | Ratio combining F1 and F2 values of corner vowels (/a/, /u/, /i/) as defined in the literature | -- | N/A | No |
Articulation | Vowel Articulation Index (VAI) | Reciprocal of the Formant Centralization Ratio | -- | N/A | No |
Articulation | Goodness of Pronunciation | Posterior probabilities from an acoustic model reflecting pronunciation quality | -- | N/A | No |
Articulation | Wideband Perceptual Estimation of Speech Quality (PESQ) | Objective measure of speech quality based on perceptual modeling | -- | Torchaudio-SQUIM (docs) | ✅ |
Articulation | Short-Time Objective Intelligibility (STOI) | Predicts speech intelligibility by comparing short-time temporal envelopes of reference and degraded signals | -- | Torchaudio-SQUIM (docs) | ✅ |
Articulation | Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) | Signal fidelity measure that is invariant to signal scale | dB | Torchaudio-SQUIM (docs) | ✅ |
Articulation | Mean Opinion Score (MOS) | Subjective estimate of audio quality rated by a neural network model trained on human ratings | -- | Torchaudio-SQUIM (docs) | ✅ |
Spectral | Spectral Gravity | Spectral centroid (center of gravity) of the signal | Hz | Praat Parselmouth (docs) | ✅ |
Spectral | Spectral Deviation | Spread of spectral energy around the centroid (second moment) | Hz | Praat Parselmouth (docs) | ✅ |
Spectral | Spectral Skewness | Asymmetry of the spectral energy distribution (third moment) | Hz | Praat Parselmouth (docs) | ✅ |
Spectral | Spectral Kurtosis | Flatness (peakedness) of the spectral distribution (fourth moment) | Hz | Praat Parselmouth (docs) | ✅ |
Spectral | Mel Frequency Cepstral Coefficients | Multivariate spectral representation based on the Mel frequency scale | -- | Torchaudio (docs) | ✅ |
Spectral | Linear Predictive Cepstral Coefficients | Cepstral coefficients derived through Linear Predictive Coding | -- | N/A | No |
Spectral | Perceptual Linear Prediction | Spectral representation based on the Bark scale with equal-loudness pre-emphasis | -- | N/A | No |
Beyond the descriptors listed below, users can extract additional acoustic representations such as:
Note: This section is actively under development. Coming updates will address usability, efficiency, clarity, robustness, and overall effectiveness. We welcome any feedback—feel free to reach out via email at fabiocat@mit.edu or open an issue on GitHub.