Google AI researchers of “Learning the Depths of Moving People by Watching Frozen People” received the Best Paper Honorable Mention Award last week in the Computer Vision and Pattern Recognition (CVPR) 2019 held in California.
The paper used data sets of uploaded Mannequin Challenge videos to teach robots to perceive 3D spaces within 2D images. The team was composed of seven researchers: Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Ce Liu, Bill Freeman, and Noah Snavely.
Robots, unlike humans, cannot perceive 3D spaces when presented with a 2D picture or video. Thus, it creates navigation problems for machine learning — most notable example is a self-driving car that cannot recognize sudden changes on the road like crossing pedestrians and bikers.
Enter the Mannequin Challenge, which are social media videos of people striking a pose, seemingly “frozen,” while a cameraman pans through the different scenes. It turns out that those videos are perfect for training AI on how to perceive 3D spaces in a 2D video format.
The team combed through thousands of Mannequin Challenge videos from YouTube and picked 2,000 videos to be included in the data set. Videos were then checked to weed out ones that would give invalid data. Samples of these are videos in which someone “unfroze” or videos that used specialized lens or filters.
After selecting the valid videos, the team used it to train a neural network to predict the depth of a moving object. Based on the series of tests, they were able to conclude that their method yielded more accurate perceptions of depth than previous state-of-the-art methods.
Google AI Research’s Methods
Computer Vision is an interdisciplinary study that aims for computers or machines to understand digital images or videos. One of its sub-studies is scene reconstruction, where computers can reconstruct a scene’s geometry from 2D image data.
According to Dekel and Cole, the team worked on the computer vision problem with a deep learning-based approach. The more videos, or data sets, fed into the neural network, the more it learns how to perceive depth.
To achieve the Google AI team’s success model, they used Structure for Motion (SfM) and Multi-View Stereo (MVS) techniques to compute for depth. The depths were recorded and tagged as “ground truth” for the neural network to recognize.
The MVS predict the depth for non-moving elements in a frame, for example, a picture of “frozen” people. However, in reality, the people within a frame are moving as well. SfM technique helps in improving the accuracy of predicted depth. Before running the SfM, the team computes for the optical flow of the video that isolates images.
Aside from SfM, the team also computed for motion parallax and 2D optical flow between an input frame and another frame in the video for the computer to compute depth distances.
After training the neural network with videos of stationary humans, the team then choose real-world videos of complex human actions captured by a moving hand-held camera. Based on the computations the computer analyzed from previous videos, it then adjusted and started learning how to predict the depth of the images even if they were moving freely.
The team compared their computer’s predicted depth maps with other depth-prediction models like Deep Ordinal Regression Network for Monocular Depth Estimation (DORN), Chen et al.’s Single-image depth perception in the wild, and Depth and Motion Network for Learning Monocular Stereo (DeMoN). They found out that their model has the highest percentage of accuracy.
Google AI team’s research helps further advance the studies in robotics.
The Mannequin Challenge
The uploaded videos of the Mannequin Challenge were critical elements of the team’s research.
Following from viral internet memes like Planking and Ice Bucket Challenge, the Mannequin Challenge trended back in 2016.
The goal of the challenge is to imitate mannequins in various poses while music plays in the background. A single cameraman records the video as the challengers hold their poses for as long as possible.
Challengers created creative scenes often in limited spaces, sometimes showcasing a general theme with each pose. The challenge was created in high schools in the US. The first video was uploaded to Twitter last October 26, 2016.
Other videos show people doing the challenge at parties, bars, and wide-open spaces like theaters or parks. There are videos with varied poses that were complex and challenging for the computer’s machine learning.