Towards A Foundation Model for Clinical Voice Biomarkers

Key findings

Introduces VoiceFM, a contrastive clinical-voice foundation model that aligns a fine-tuned Whisper encoder with 44 clinical features on the Bridge2AI-Voice dataset (984 participants, 40,056 recordings, 5 medical centers). Frozen-embedding linear probes reach mean AUROC 0.952 across five screening tasks - outperforming frozen Whisper/HuBERT - generalize to held-out patients (0.99 for Alzheimer's/dementia/MCI), and transfer without fine-tuning to Spanish-language Parkinson's detection and the mPower smartphone study.

Abstract

Source: publisher

Vocal biomarkers, encompassing voice and speech, have largely been developed for individual conditions in isolation, limiting their generalizability across diseases and recording settings. To address this, we introduce VoiceFM, a contrastive model that learns general-purpose clinical voice representations by aligning audio embeddings with rich clinical metadata. Using the Bridge2AI-Voice dataset (984 primarily English-speaking adult participants - 846 used for training and 138 held out as a temporally separated validation cohort - 40,056 recordings totaling 176 hours across 5 academic medical centers), VoiceFM pairs a fine-tuned Whisper large-v2 encoder with a tabular transformer over 44 clinical features via symmetric InfoNCE loss. Linear probes on frozen VoiceFM embeddings achieve mean AUROC 0.952 across five evaluation tasks, significantly outperforming Frozen Whisper, Frozen HuBERT, and the contrastively trained VoiceFM-HuBERT. On the 138-participant held-out cohort, VoiceFM-Whisper achieves AUROCs of 0.99 for Alzheimer's/dementia/MCI and 0.89 for airway stenosis. VoiceFM representations transfer to three external datasets without retraining and improve few-shot classification; the model transfers without fine-tuning to Spanish-language Parkinson's disease detection (NeuroVoz, AUROC 0.93) and to the mPower smartphone study (participant-level AUROC 0.87). Together, these results show that contrastive alignment between voice and rich clinical metadata can serve as the basis for a clinical voice foundation model that generalizes across diseases, languages, recording conditions, and patients enrolled after model freeze.

Topics

speech-voice-biomarkers
ml-nlp-knowledge

Associated projects

Bridge2AI Voice
NIH OD OT2 OD032720 (MPI)

Lab authors

Satrajit S Ghosh