Discrimination in Artificial Intelligence for Voice Applications

AI History

The notion of Artificial Intelligence (AI) has been scrutinized throughout history. For instance, classical philosophers attempted to represent human knowledge and intelligence using symbolic abstractions. Indeed, the first official spark of understanding whether machines could think was ignited by Vannevar Bush’s seminal article ‘As We May Think’ in 1945, in which he had foreseen how the future could benefit from amplifying human knowledge using man-made machines. Subsequently, in 1950, Alan Turing proposed a logical framework to build and test intelligent machines, paving the road for John McCarthy to coin the term Artificial Intelligence in the first conference concerning this subject in 1956.

One rationale behind building AI systems is to mimic human intelligence and behavior, and now, scientists’ anticipations for the benefits of AI are being realized. These benefits are manifested in the surge of AI applications serving vast disciplines (e.g. healthcare, legal system, business,…etc). Whereas, the ramifications of replicating human behavior and building anthropomorphic machines haven’t been fully questioned yet.

Is it ideal to build a computational abstract simulating human intelligence?

Will AI systems inherit/mirror human prejudice?

Gender and Racial Disparities in AI

Many questions about racial and gender disparities have been raised recently given the ample empirical evidence highlighting them in AI models. Discrepancies in the performance of deep learning (DL) models due to race and gender have been a trending topic. Despite the inflation in the number of neural networks published to solve demanding tasks using diverse modalities (e.g. image, text, audio, and graphs), it has been occasionally noted that these models are molded with racial and gender biases. For instance, previous studies pointed out discrepancies in facial biometric systems, showing that classifiers performed best on male subjects with lighter skin and worst on female subjects with darker skin. In the case of predictive algorithms, a software called COMPAS is widely used as a risk assessment tool for criminals. It was reported that this software, favored white defendants over black defendants when deciding whether the defendant will commit the same crime again or not. This software generated significantly more false positives for black defendants compared to white ones, as illustrated in Figure 1.


Fig.1 - COMPAS predictions reported in Dressel and Farid, 2018

Similarly, in healthcare, racial disparities have been addressed which might yield detrimental effects on diagnosis and access to treatment - for example, a study highlighted that a model was diagnosing a black patient as healthier than a white patient with the same condition.

Inevitably, it is foreseeable that we will find gender and racial disparities in voice-based deep learning models. For instance, automatic speech recognition (ASR) systems have shown discrepancies in word error rate (WER) across race and gender. WER results for race and gender are shown in Figures 2 and 3, respectively.


Fig.2 - WER reported in Koenecke et al., 2020


Fig.3 - WER reported in Tatman, 2017

These striking examples expose our lack of ability to interpret neural networks and how they learn. One might guess that the main goal when developing a state-of-the-art DL model, is to only improve accuracies without considering the consequences. To better address these issues, we need to focus more on understanding the learning paradigm itself along with the datasets considered for training.

A recent study was published discussing racial and gender disparities in voice biometric systems which we will dive into deeper in this blog.

Racial and Gender disparities in Voice Biometrics

Chen et al. created a matched corpus of non-speech voice recordings from 300 speakers. The speakers were balanced in terms of their racial profile (i.e. White, Black, Latinx, and Asian) and gender (i.e. Male and Female). Firstly, they extracted, from the compiled dataset, inherent voice characteristics encompassing temporal, spectral, cepstral, and entropy-related acoustic features. Then, they evaluated the aforementioned features (15 features) across racial and gender subgroups, demonstrating significant differences across gender subgroups in most features, except for $\Delta$ MFCC, Perm entropy, and SVD entropy. Conversely, across racial subgroups, they observed a small number of features that showed significant differences (e.g. F0, F1, F2, PDF entropy, and Perm entropy). Their work pipeline is presented in Figure 4.


Fig.4 - Taken from Chen et al., 2022

In the second part of their study, they evaluated speaker identification performance of different voice biometric models (e.g. MS Azure, 1D CNN, TDNN, ResNet18, ResNet34, and AutoSpeech). The authors found that the top 3 models (ResNet18, ResNet34, and AutoSpeech) yielded significant differences across racial and gender subgroups that are worse for Latinx subjects and males in general, as shown in Figures 5 and 6, leading them to hypothesize that the main causal factor for gender disparities is inherent vocal characteristics, as illustrated above, in addition to the models’ bias. Whereas, the main causal factor for racial disparities is the features extracted from DL models.


Fig.5 - Reported performance from Chen et al., 2022


Fig.6 - Reported performance from Chen et al., 2022

Gender Disparity

Chen et al. reported that the main causal factor for gender disparity is caused by the variability in voice properties across men and women. This statement could be further endorsed by investigating the physiological differences between gender.

Physiological-based Differences

For instance, for us to speak, we first need to make a sound with our vocal folds (VFs). Air comes from the lungs and passes through the vocal folds making them oscillate which creates a tone, as demonstrated in the lower part of Figure 7. This process is called phonation. The oscillation frequency is called fundamental frequency (f0) and is perceived as pitch. The pitch of phonation is determined as a response to the change of VFs’ elasticity and tension.


Fig.7 - MRI scan of someone speaking (source)

Projecting this information to gender, adult males have, on average, longer and thicker VFs compared to adult females, yielding slower VF vibration (lower F0) and perceptually low-pitched. On the other hand, adult female VFs are relatively shorter and thinner, producing faster VF vibration (higher F0) which is perceived as high-pitched.

Similarly, adult males typically have longer vocal tracts than females. The effect of these physiological differences is translated into sex-based extracted acoustic features. It is worth mentioning that the acoustic signals generated are highly dependent on the geometry of the vocal tract, vocal and nasal cavities, and VFs. However, several studies have shown that other physiological data (e.g. height and weight) have a weak correlation with the perceptual outcome (Hollien and Jackson, 1973; Kunzel, 1989; Van Dommelen and Moxness, 1995; Collins, 2000; Gonzales, 2004).

Cultural-based Differences

Based on these studies, researchers started to consider other influences beyond physiological features that impact perceptual features. Indeed, they found cultural influences on speaking pitch levels (Deutsch et al. (2009)). This means that long-term exposure to a specific community might create a mental representation for a speaker influencing their pitch levels, suggesting that cultural factors have as much impact (or maybe higher) on perceptual output as physiological representations. In a similar vein, studying the prosodic features of gender-neutral interactional particles in the Japanese language such as ne and yo, Hiramoto-Sanders (2002) found that more prosody is expressed in feminine speech compared to masculine ones.

Thus, the difference in vocal inherent characteristics, in the paper, might mirror socio-cultural and physiological variabilities between gender.

Model-based Differences

As shown in Figure 6, four out of six models featured gender disparities. MS Azure performed significantly better on male subjects than on female subjects. On the other hand, the top three models performed worst on male subjects. Although the inputs to the six models were consistent, the performances across gender were inconsistent. As a result, Chen et al. suggested that the models’ extracted features contributed to the observed performance discrepancy. We will further unpack the model-based impact in section Dissecting AI Features.

Racial Disparity

Chen et al. indicated that the voice biometric models they studied performed worst on Latinx speakers. As a result, they suggested that the causal factor for racial disparities is predominantly in the features extracted from the DL models.

Unlike gender-dependent voice studies, studying racial groups is more staggeringly complex given that a lot of studies attempted to tackle the same topic but with inconclusive results. The main question here is if the distinguishability of a speaker’s race is due to inherent physiological differences or acquired dialectal differences. As Kreiman and Sidtis (2011) mentioned that if it was the former, then, racial identity is consistently molded in a speaker’s speech. Whereas, if it is the latter, then, racial identity might vary across speakers depending on their dialectal exposure.

Physiological-based Differences

There has been contradictory data regarding physiological differences across races. For instance, Xue et al. (2009) evaluated the vocal tract morphological differences of adult male speakers from three different racial populations (White American, African American, and Chinese). Their findings suggest that there is a significant difference in total vocal tract volume and oral volume across the three racial groups. Also, Chinese subjects featured different overall vocal tract configurations relative to their White and African American cohorts. It is worth mentioning that the subjects were recorded uttering the neutral /ɑ/ sound to alleviate the effect of language or dialect. Furthermore, Walton et al. (1994) reported that the human accuracy of speaker race identification from a sustained /ɑ/ vowel uttered by 50 white and 50 black males is above chance, suggesting that black speakers had higher amplitude and frequency perturbations relative to their counterparts. In addition to a significantly lower harmonics-to-noise ratio.

Conversely, Ajmani (1990) reported no significant differences in laryngeal sizes between adult Nigerians of both European or African descent. Also, Xue and Fucci (2000) observed no significant acoustic differences between 44 Caucasian American elderly speakers and 40 African American elderly speakers, suggesting that most acoustic parameters are highly sex-dependent.

Building on all previous studies, it is safe to say that there is quite uncertainty when it comes to the significance of vocal variability across racial groups, highlighting the lack of studies considering the interplay across physiological, acoustic, and perceptual components. However, it should be incentivized to further probe and design appropriate experiments for such questions to elucidate if the racial differences in voice are physiologically ingrained or learned in dialect. Also, it might affect our understanding of treatment approaches for different racial groups. Radowsky et al. observed disparities across races after thyroid and parathyroid surgeries yet with an unclear causal explanation.

Surprisingly, recent work by a group at MIT showed that AI models were capable of predicting self-reported race from different imaging modalities (X-ray imaging, CT chest imaging, and mammography) with high performance indicating that these models captured racial-based physiological features.

That being said, would any disparities/similarities in inherent vocal characteristics across racial groups, reported by Chen et al., be due to dialectal differences, or the variations of physiological data across speakers might play a role here?

Cultural-based Differences

As briefly mentioned with gender disparity, culture may play a role in vocal differences. Deutsch et al. examined females’ pitch levels in two Chinese villages, showing that they significantly differed by approximately three semitones although they shared the same race, language, and dialect. This experiment suggested that F0 production could reflect a learned representation acquired through linguistic exposure in a community.

Model-based Differences

As stated in the beginning, it is suggested that the features extracted from the models are the predominant factor for the observed disparities. As shown in Figure 5, five out of six models performed worst on Latinx subjects compared to their counterparts with different levels of significance, emphasizing the inconsistencies in the models’ performances.

They make the argument that Latinx speakers feature lower F1 values in comparison to White speakers (Xue et al., 2006), suggesting that the DL models are not capable of identifying the F1 band in Latinx speakers, hence the degraded performance. However, when measuring the F1 values on the matched dataset, one can notice the huge overlap in the F1 distributions between White and Latinx speakers, as shown in Figure 8, which might contradict the aforementioned assumption.

With this in mind, the yielded performance could then be justified by saying that the DL models exacerbated the slight differences in acoustic features, leading to obvious disparities. Nevertheless, it could be argued that the reported explanation for racial disparities, in the paper, might need more unpacking to further understand the models’ behavior.

Why would models feature disparities with varying levels of significance?


*Fig.8 - Reported F0 and F1 values across races (W: White, B: Black, A: Asian, L: Latinx)* from Chen et al. 2022**

Dissecting AI Features

During training, DL models learn to optimize for a specific task or a loss function by transforming input features into task-related labels. This process is carried out by feeding samples from a dataset to the model. The architecture of the DL model determines its complexity and the type of information (e.g. sequences, patterns, local/global features, etc.) learned by the model. That being said, the performance of any DL model majorly depends on:

i) the dataset fed to the model ii) the format of the input iii) the model’s architecture iv) the optimization function.

To further sift through the reasons behind the models’ biases and disparities, it is important to investigate the contribution of each of the mentioned factors to the final outcome. In the next section, we shed some light on the impact of datasets used for evaluation and pre-training on models’ performance.

Non-speech voice snippets and Inclusive Datasets

Chen et al. proposed a matched dataset including speakers from both genders and four different races. Additionally, they compiled the non-speech voice snippets from the mPower dataset. The rationale behind collecting only non-speech snippets was to alleviate the linguistic and accent effects on the voice. This line of research has been endorsed for the latter reason and security reasons as well, to have recordings that are not quite identifiable (Poddar et al., 2017). Accordingly, the authors have selected the voice snippets of speakers uttering the /ɑ/ vowel for 10 seconds, suggesting that this vowel has the most occurrences compared to other syllables. That being said, they found visible disparities in the performance of the models as discussed. As mentioned before, they pointed out that Latinx speakers feature lower F1 during /ɑ/ phonation compared to White/Caucasian speakers (Xue et al., 2006), implying that the discrepancy in performance affecting Latinx speakers might be due to the technical gap in the models’ feature extractors.

Before making the claim against models’ biases, it is crucial to question the format of the input fed to the model. For instance, the authors focused on the /ɑ/ phonation as a way of benchmarking the performance of DL models. However, it is intuitive that Latinx speakers would feature lower F1 given that the ae sound doesn’t exist in the Spanish language. Consequently, the phonation of /ɑ/ would be significantly different between Latinx and White speakers, triggering a question: how can we compare F1 or other acoustic features across races although they have different vowel inventories?

We might be using an already biased dataset for evaluation since the selected vowel is not neutral across different races.

One might argue that the reported performances could be phoneme-dependent. Would the models maintain the same performance discrepancy if the speakers uttered different phonemes? or would the models disadvantage other racial groups instead?

However, the reported results might insinuate that the DL models didn’t get exposed to different vowel inventories, leading us to explore the datasets used for pre-training the models. In the wake of trying to improve the current DL models to be less biased, most of the studies mentioned usually suggest building more inclusive datasets to help the model learn more generalized features independent of race and gender (Tatman, 2017; Koenecke et al., 2020; Chen et al., 2022). Indeed, aiming for a well-designed dataset might help dampen the disparity issue. Nevertheless, one might argue that it is very challenging to do so. As a matter of fact, it could be infeasible to collect data covering different races, cultures, languages, dialects, etc. and we have seen how people who share the same race and language could still have significant differences in pitch levels (Deutsch et al., 2009). And what about people that belong to mixed-race families, would they share physiological/acoustic features similar to a specific race?

Additionally, we have seen the impact of languages on the uttered vowels, as demonstrated in the case of Latinx speakers uttering /ɑ/. This could open the door for more questions, for example: would Latinx speakers who learned English as a first language show similar acoustic features (e.g. F1) as Latinx speakers who acquired the English language later in their life?

Even if we manage to account for between- speaker/gender/race variability, how can we reduce within-speaker variability (Lavan et al., 2019a; Lavan et al., 2019b)? Would the model’s performance vary on the same speaker but with a minimal change in the way the speaker is uttering a vowel?

To answer these questions, we need to probe more thoroughly if within-racial groups share physiological, acoustic, or perceptual features. Then, study how to factor out the cultural influence within-groups before attempting to treat the problem as discrete categories of races. To the best of our knowledge, our voices are more of a continuous spectrum and we would always have a margin of error and discrepancy even if we compiled an inclusive dataset. Kathiresan (2022) created gender-balanced training and testing sets from two datasets with different languages (English and Mandarin). Then, he extracted i- and x-vector speaker embeddings to test them in a speaker recognition task. Interestingly, there was still obvious disparities between gender although data samples were balanced during training and testing. This finding might endorse that datasets are not the only factor for biased results.

Architecture-based Effects

Besides focusing on datasets, we might need to explore different architectures for speaker identification models. For instance, most of the current models are convolution neural network-based (CNN). CNNs are known for extracting fine-grained and local features from the input data. However, CNNs were traditionally used on images to generate representations covering their spatial components. Over time, CNNs were adapted to work on voice signals by using 2D spectrograms as an equivalent for a 2D image fed to the architecture (Dieleman et al., 2011). A spectrogram has a temporal dimension that qualifies it as sequential data (Palanisamy et al., 2020), hence, some efforts have been made to modify the kernel size in CNNs to go in one direction and account for this temporal sequence (Chen et al., 2019). Others complimented the CNN with modules that capture sequential information such as Recurrent neural networks (RNN) (Phan et al., 2017). RNNs are capable of capturing temporal dependencies in sequential data by featuring an internal memory component. Furthermore, others used transformers to extract global and contextual features from audio (Gulati et al., 2020).

The level of model complexity and the information learned could highly vary as we continue to augment and modify the architectural bases of feature extractors. Thus, it will be advantageous to break down the problem of model-based discrepancies into architecture-based differences and dataset-based differences. More ablation studies and experiments on such models might boost our understanding of the contributions of the model’s facets to the outcome.

What’s Next?

As a result of reporting racial and gender disparities, several documentaries, blogs, and interviews have chewed over this topic from different perspectives (e.g. political, technical, and social). As they all advocate in favor of creating more inclusive datasets, we have argued here that changing the dataset is one aspect of a bigger problem that might not adequately circumvent these disparities. We suggest elaborating on all factors responsible for such outcomes, including datasets, and further pinpointing their contributions. Additionally, understanding the nature of variabilities in gender and race is crucial to better design experiments that account for such differences.

Finally, acknowledging the limitations of our current models might help alleviate the fear of echoing human prejudice in models. Humans are biased in their nature. Thus, if the ultimate goal is to build anthropomorphic models (human-like), having biased models is an inevitable fate. Although it is argued that we are technically building human-inspired models, not human-like ones, since human-inspired models are models that emulate features from human behavior. Examples of this include CNN layers that learn hierarchical information, self-supervised learning that reinforces common-sense understanding in models, and the concept behind learnable neural networks. That being said, it is a matter of time and effort to eventually disentangle some of the misapprehensions and ramifications concerning AI.