As you prepare your next asynchronous lecture, you set up your laptop on an appropriately sized delivery service box for the best camera angle, turn on the ring light, click start on the extension that ...
As you prepare your next asynchronous lecture, you set up your laptop on an appropriately sized delivery service box for the best camera angle, turn on the ring light, click start on the extension that embeds your webcam, and press record. You did your best to incorporate engagement, representation, and action and expression into your lecture. You used closed captioning for students who cannot hear or listen to the audio. That is, you incorporated principles of universal design for learning. Your lectures should be accessible to and for everyone.
Imagine that you are beginning a unit on career development. You chose to include a YouTube video of Steve Harvey using his personal life story as a mechanism to encourage Black Americans to take risks in their own lives. You included this video precisely because of Harvey’s use of his personal narrative as an attempt to connect to the diversity within your classroom. To your surprise, the automatic video transcription (AVT) is—quite frankly—a hot mess. Approximately one minute into the video, Harvey switched from Mainstream American English (MAE) to African American English (AAE) as he narrated his story. He said, “Now, this here. This a gold star moment right here.” AVT transcribed this as, “It is here this is gonna stop bombing right here.”
Unfortunately, these types of transcription errors in automated speech recognition are far from infrequent. They do not affect only supplemental materials like embedded videos. They also affect what is transcribed when instructors or students who speak non-MAE dialects use AVT systems to capture their own words. We suggest that these types of transcription errors represent inherent inequities in the AVT systems on which we have come to rely during the COVID-19 pandemic. The presence of these errors draws negative attention to subtle differences in oral language that relate to the use of nonmainstream dialects, English language learner status, or even communication difficulties. As a result, they have the potential to affect what and how students learn.
Why do these gaps exist?
There are 24–30 different dialects of American English (Joseph & Janda, 2003). The dialects are associated with different income levels, geographic regions, racial and ethnic groups, or combinations of these factors. Some dialects, such as New York City English, do not differ greatly from MAE. Others, such as AAE, diverge widely regarding speech sound production, grammar, and vocabulary. For example, one speech sound production rule of AAE affects production of word-final consonant clusters (e.g., the “ft” in “left”). In AAE, the final consonant of the cluster (i.e., “t”) is not produced when the following word begins with a consonant (e.g., “left hand” becomes “lef hand”). Grammar rules in AAE differ from MAE, especially concerning use of forms of the verb "be," “Wh-”questions, and negative statements. Some examples: “He big”/”He is big”; “He be coming around all the time”/”He always comes around”; “What you did that for?”/”Why did you do that?”; and “She don’t like no vegetables”/“She doesn’t like any vegetables.” AAE vocabulary tends to be more dynamic and flexible than MAE vocabulary. As a vocabulary item from AAE is acculturated into MAE, its use in AAE tends to rapidly diminish (e.g., the rise and demise of “bling” in recent years). The AAE meaning attributed to a word may also cause temporary ambiguity in comprehension if not supported by enough context.
All dialects of English are fully formed linguistic systems that are not and should not be considered substandard variants or vernaculars of MAE. They simply differ in form and content from MAE just as MAE differs from British English. But the incidence and prevalence rates of non-MAE dialect use in college classrooms is unknown because these demographic characteristics currently are not captured in national reporting databases.
We argue that the difficulty systems like AVT have with non-MAE dialects stems from an implicit bias against these dialects. This implicit bias begins early, seeps into general society, and affects education across all levels. Elementary teachers generally have a more negative view of students who use non-MAE dialects (Diehm & Hendricks, 2020). Yet little to no support is provided for non-MAE speaking children to become bidialectal in the same way support is provided for children who are English-language learners. Researchers and companies have developed programs to record “teacher talk” in classrooms and schools. In one case, however, a developer had to use adult language samples from the Northeast US to train the machine algorithm because adult samples from the Southeast US could not be accurately recognized and transcribed. When it comes to voice assistants, smart speakers, and alternative and augmentative communication devices, it is possible to choose MAE, British, Australian, or even Spanish-influenced English options. But there are no options to choose AAE or other non-MAE dialects. To put it another way, non-MAE speakers currently are expected to conform to a set of MAE standards to access these increasingly ubiquitous automated systems. It is clear that change is needed to make space for all dialects.
Where do we go from here?
Speech recognition systems likely will be a large part of both online and face-to-face instruction as we emerge from the confines of the pandemic. We should not remain mired in our usual manner of doing things which is flawed by the limitations of social biases and experiences. There is a lot of room to build more inclusive environments and elevate all voices in the classroom.
Companies such as Google and Amazon acknowledge the difficulty systems like AVT have in recognizing non-MAE speech. These systems are developed by and designed for MAE speakers. Errors occur because the systems attempt to match individual spoken words to the items in the database that most closely resemble each single item. Consequently, they cannot negotiate the group and individual variations inherent in different dialects because the databases do not contain enough tokens of non-MAE language (Biadsy, 2011). Automated speech recognition systems cannot accurately recognize an average of 35 percent of the speech of non-MAE speakers (Tatman & Kasten, 2017). An attempt to improve the underlying algorithm reported improvement to an error rate of 14.6 percent when comparing MAE to non-MAE dialects (Biadsy, 2011). For comparison, the error rate of MAE to Indian English was 6.3 percent. The tech companies offer some options to build custom vocabularies to help with spoken language recognition; however, these services generally can be cost prohibitive and limited to a single product or platform. Obviously, considerable work is required to improve automated systems’ ability to accurately recognize and transcribe the speech of non-MAE speakers in real time.
What can instructors do to help resolve these inequities in the classroom?
To paraphrase Steve Harvey, creating space for all American English dialects in the classroom really would be a gold-star moment.
Biadsy, F. (2011). Automatic dialogue and accent recognition. Doctoral dissertation. Academic Commons: Columbia University. https://academiccommons.columbia.edu/doi/10.7916/D8M61S68
Diehm, E. A. & Hendricks, A. E. (2020). Teachers’ content knowledge and pedagogical beliefs regarding use of African American English. Language Speech and Hearing Services in Schools, 52(1), 100–117. https://doi.org/10.1044/2020_LSHSS-19-00101
Joseph, B. D., & Janda, R. D. (Eds.). (2003). The handbook of historical linguistics. Blackwell Publishing. https://doi.org/10.1002/9780470756393
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J.R., Jurasky, D., & Goel, S. (2020). Racial disparities in automated speech recognition. PNAS, 117(14), 7684–7689. https://doi.org/10.1073/pnas.1915768117
Tatman, R., & Kastan, C. (2017). Effects of talker dialect, gender & race on accuracy of Bing Speech and YouTube automatic captions. Interspeech, 934–938. https://doi.org/10.21437/Interspeech.2017-1746
Lori A. Bass, PhD, CCC-SLP, is an assistant professor at Worcester State University. She earned her PhD in communication sciences and disorders from Florida State University. Her areas of scholarship include supporting the needs of students at-risk for poor academic outcomes as the result of cultural and linguistic diversity.
Rihana S. Mason, PhD, is a research scientist at the Urban Child Study Center (UCSC) at Georgia State University. She earned her PhD in experimental psychology from the University of South Carolina. Her areas of scholarship include vocabulary development in diverse populations. She also evaluates diversity, equity, and inclusion pipeline programming.