For nearly four years, I worked on augmented reality and virtual reality (VR) technologies, but I could never understand the drive to develop products to replace in-person activities. Virtual meetings, virtual collaborations, virtual hangouts – they all seemed silly. Why should we push simulated interaction when we have the luxury of face-to-face connection?
With the current state of our world, we no longer have the luxury of face-to-face connection, and we don’t know how long we will be confined to this restricted lifestyle. We also don’t know what lasting implications will stem from the fear and distance implanted by the COVID-19 crisis. One thing is certain: If things continue on this way, we’re going to need a little more than Zoom meetings to hold us together.
Given that I now see more potential for VR’s ability to enhance remote presence, I would like to share some insight on how to get around the headset occlusion illusion in face tracking. During my job hunt last summer, I interviewed at a startup focused on developing in-headset emotion tracking. Because I was not privy to the company’s headset models or camera views, I created a document detailing which facial actions could be detectable at various fields of view (FOVs). Though I still believe face-to-face interaction will remain the most important form of communication, new forms of virtual communication will become more valuable in the post COVID-19 world. I’m sharing this document to help bring expressive VR communication to life sooner.
Tracking What You Cannot See
Throughout my years working in face tracking, I’ve observed a superstition among engineers and researchers that you cannot track facial landmarks (features like the eyebrows, eyes, and mouth) when they fall outside the camera view. This belief is not entirely true. You don’t need to see an eyebrow to know whether it’s raising or furrowing, and you don’t need to see the nose to know when it’s wrinkling.
Our faces bulge, stretch, and wrinkle uniquely with each facial action. I have used these changes to train labelers to recognize and accurately classify discrete expressions from skin movement alone. With well-directed documentation, a comprehensive set of examples, and a high-quality camera, you can extrapolate bounds of information with a limited FOV.
The minimal FOV required for eye-tracking is often enough to track a handful of actions. View A (in the image set below) is most reflective of what a gaze-based tracker view may look like. Though the main goal of eye tracking cameras is to cover just enough of the eye to observe changes in gaze, its potential is much greater. Even with this concentrated view, you can still detect upper lid raiser (AU5), cheek raiser (AU6), and lid tightener (AU7) with a relatively high degree of certainty. These actions are useful for their applications in measuring attention, emotion, and engagement; they are also crucial signals in communication.
Many people get blocked by action unit names like cheek raiser and assume, “We can’t track cheek raiser because our FOV doesn’t cover the cheek area.” But cheek raiser is more than its name reveals; it’s an action caused by the contraction of the orbicularis oculi muscle, which surrounds the eye. While movements of the orbicularis oculi do impact the cheeks, most changes actually take place in the eye socket area. As long as you have a marginal view of the eye corners or a sliver of skin under the lower eyelid, you can determine whether or not cheek raiser is occurring. Similar concepts apply to the other actions I have listed in the images below.
NOTE TO READERS: I left Facebook in late 2019, because the entire AR/VR organization would only offer me short-term employment.
- My salary was 60% the value of a UX Researcher.
- I was not given stock.
- I was ineligible for bonuses.
- I was given a forced fix salary.
If you find the content in this post helpful, please read, “Big Tech’s Homogenous Hiring Habits” and educate yourself on the importance of valuing cross-disciplinary knowledge in emerging technology.
Capabilities With Different FOVs
This chart shows which action units (AUs) are possible to detect with various fields of view. Keep in mind this is an abridged breakdown of what may or may not be possible with different FOVs (i.e. I did not include predictions for most of the lower face, nor did I include which combination shapes should or shouldn’t be detectable). Conditions will change based on additional factors such as camera angle and how the headset rests on the face. (Is the headset heavy? How does its weight and pressure affect various areas of the face?)
If you are working on in-headset face-tracking, don’t let assumptions limit your potential. The face is complicated and full of clues. All you need to do is find the right clues, and you can accomplish a lot from a little.
AU1 = inner brow raiser
AU2 = outer brow raiser
AU4 = brow lowerer
AU5 = upper lid raiser
AU6 = cheek raiser
AU7 = lid tightener
AU9 = nose wrinkler
AU10 = upper lip raiser
AU12 = lip corner puller
green box = detectable at most levels of intensity, robust to facial structure
yellow box = detectable at moderate to high intensity levels, less robust to facial structure
orange box = contingent on intensity level, fallible to certain facial structures