This post explores the intricacies of audio-to-face tech, visemes, and speech articulation – written from the perspective of a facial motion expert working on AI lip sync technologies.
From Audio to Face: The Struggle Is Real
No single mouth shape defines a particular speech sound. Our articulation changes with every phoneme, syllable, and word we utter. Though we try to simplify lip sync studies by assigning canonical shapes (or висеми) to groups of phonemes, in an absolute sense, the “right” shape does not exist. The “right” shape is always relative and depends greatly on a slew of variable conditions beyond the simple НАПОМЕНА 1 problem of coarticulation. For those working in facial animation or on audio-to-face technologies, this reality is one of the biggest challenges in creating accurate and natural-looking speech.
НАПОМЕНА 1: Though coarticulation is not necessarily simple, it begins to feel simple once you’ve been exposed to the realities of other complex and unpredictable conditions that affect articulation.
Gemma’s Gritted-Teeth Delivery In S02E07
A perfect illustration of the viseme problem can be observed in Severance Season 2, Episode 7, when the character known as Gemma asks her antagonist:
“Can you please just talk like a normal person?”
In American English, this line can be phonetically transcribed as:
Kən ju pliz ʤʌst tɔk laɪk ə ˈ nɔrməl ˈpɜrsən?
The actor playing Gemma, Dichen Lachman, delivers her line through gritted teeth with an extremely muted articulation style. (To see her more animated baseline, refer to her speech before the 47-second mark.) We feel her pain, anger, and frustration held back by her clenched jaw and minimally-moving lips. A beautiful performance for viewers – but a troubling real-world example for audio-to-face researchers.
From emotions to volume and speed, the factors affecting mouth shapes in speech are seemingly never-ending. Though Gemma’s gritted teeth and muted lips do not affect the auditory legibility of her delivery, a lip-reader would be hard-pressed to decode her words. Her articulation style severely alters the expected look of many vowels and consonants.
Phonemes & Visemes: A Closer Look
Above is a stabilized clip of Gemma’s speech highlighting her most contrastive lip shapes НАПОМЕНА 2. Observe how, despite being the most contrastive, many of these shapes are indiscernible and fail to fulfill the expected features of their associated visemes.
NOTE 2: Not all of the phonemes in Gemma’s speech are captured here. Many were left out, because they were visually indistinguishable from surrounding sounds.
Below shows the same set of phonemes from the above clip and their visual counterparts as still images. Hover above each photo to view the graphemic context for each viseme.
/p/'s /b/'s & /m/'s As Anchor Points
When assessing both the clip and still images from Lachman’s performance, it is evident that the tried and true closed-lip bilabials – /p/, /b/, and /m/ – are still closing like they’re supposed to. You can also see a slight increase in lip corner width for /i/ as well as a commendable nearly-closed rounding for /u/.
In general, while the expected viseme forms of phonemes vs. the actual forms they take on are extremely variable, some phoneme groups are fussier than others and command a more rigid arrangement of articulator positions. If you’re in facial animation or audio-to-face research, you are likely already familiar with the sturdiness of /p/, /b/, and /m/. /p/, /b/, and /m/ are typically grouped into the same viseme category: a closed lip shape. /p/, /b/, and /m/ are great anchor points when assessing the quality and accuracy of simulated speech. We love them because they always close, right? …Right?
/p/'s /b/'s & /m/'s: The Hard Truth
Unfortunately, though closed-lip bilabials can be great anchor points, even the most robust phonemes are not immune to variation. Pop open a Mr. Beast (or should I say, Nr. Veast) and watch your world crumble as the lips of the Veast fail to close for a large portion of /p/’s, /b/’s, and /m/’s. Ɱr. Veast is an avid labiodentalizer. (Read more about labiodentalization овде и овде).
You may be tempted to argue that if the lips do not close, the sound does not count as a /p/, /b/, or /m/; however, the not-fully-closed lip situation does not strip p’s, b’s, or m’s from their phonemic status, and it does not stop us from perceiving them as p’s, b’s, or m’s. In fact, these not-fully-closed versions are just common allophones of /p/, /b/, and /m/.
Read the following breakdown from Wikipedia:
Stage 5: Acceptance
Though the most reliable visemes are not as reliable as widely believed, all hope is not lost. Once you learn to face the FACS, I mean – facts, and embrace the chaos of human behavior and mechanics, deciphering our cloud-like complexities can be exciting. Let’s close out with a poignant snippet from neuroscientist, primatologist, and goated lecturer, Robert Sapolsky:
More Lip Sync Resources
For more rigid and clock-like examples of visemes, check out my:
- Human speech variability (beyond the IPA charts)
- Linguistic foundations
- The anatomy of articulation (jaw, lips, tongue, teeth)
- Coarticulation and edge cases
- Why “canonical” viseme don’t work
- How to design modular speech systems
- Speech vs. emotion (how they can hinder or harmonize)
- Visemes, FACS, and flexible blendshape formulas