Image Credit: Google
The animated masks, glasses, and hats that apps like YouTube Stories overlay on faces are pretty nifty, but how on earth do they look so realistic? Well, thanks to a deep dive published this morning by Google’s AI research division, it’s less of a mystery than before. In the blog post, engineers at the Mountain View company describe the AI tech at the core of Stories and ARCore’s Augmented Faces API, which they say can simulate light reflections, model face occlusions, model specular reflection, and more — all in real time with a single camera.
“One of the key challenges in making these AR features possible is proper anchoring of the virtual content to the real world,” Google AI’s Artsiom Ablavatski and Ivan Grishchenko explain, “a process that requires a unique set of perceptive technologies able to track the highly dynamic surface geometry across every smile, frown, or smirk.”
Google’s augmented reality (AR) pipeline, which taps TensorFlow Lite — a lightweight, mobile, and embedded implementation of Google’s TensorFlow machine learning framework — for hardware-accelerated processing where available, comprises two neural networks (i.e., layers of math functions modeled after biological neurons). The first — a detector — operates on camera data and computes face locations, while the second — a 3D mesh model — uses that location data to predict surface geometry.
Above: The 3D mesh in action. Image Credit: Google
Why the two-model approach? Two reasons, Ablavatski and Grishchenko say. First, it “drastically reduces” the need to augment the dataset with synthetic data, and it allows the AI system to dedicate most of its capacity to accurately predicting mesh coordinates. “[Both of these are] critical to achieve proper anchoring of the virtual content,” Ablavatski and Grishchenko say.
The next step entails applying the mesh network to a single frame of camera footage at a time, using a smoothing technique that minimizes lag and noise. This mesh is generated from cropped video frames and predicts coordinates on labeled real-world data, providing both 3D point positions and probabilities of faces being present and “reasonably aligned” in-frame.
Recent performance and accuracy improvements to the AR pipeline come courtesy of the latest TensorFlow Lite, which Ablavatski and Grishchenko say boosts performance while “significantly” lowering power consumption. They’re also the result of a workflow that iteratively bootstraps and refines the mesh model’s predictions, making it easier for the team to tackle challenging cases (such as grimaces and oblique angles) and artifacts (like camera imperfections and extreme lighting conditions.)
Above: Model performance compared. Image Credit: Google
Interestingly, the pipeline doesn’t rely on just one or two models — instead, it comprises a “variety” of architectures designed to support a range of devices. “Lighter” networks — requiring less memory and processing power — necessarily use lower-resolution input data (128 x 128), while the most mathematically complex models bump up the resolution to 256 x 256.
According to Ablavatski and Grishchenko, the fastest “full mesh” model achieves an inference time of less than 10 milliseconds on the Google Pixel 3 (using the graphics chip), while the lightest cuts that down to 3 milliseconds per frame. They’re a bit slower on Apple’s iPhone X, but only by a hair: the lightest model performs inference in about 4 seconds (using the GPU), while the full mesh takes 14 milliseconds.