Gesture Recognition from Stereo Vision

Scope
Research supervised by Jan Cech & Radu Horaud at INRIA, Perception Lab.

Duration
June 2014 — July 2014 (2 months)

Applied Skills
Machine Learning, Computer Vision, Python, OpenCV, NumPy & SciPy

Summary

We developed a gesture recognition system to enable human-to-robot nonverbal communication. The system only requires two cameras for input and operates at the frame level to ensure optimal responsiveness. Using binocular vision, we determine the spatial configuration of the human’s body and predict what gesture is being made in the current frame. The system accurately recognized the gestures in more than 92% of the frames on a test data set.

  Nao, the humanoid robot we developed for.


Nao, the humanoid robot we developed for.

Method

  We acquire the left and right images frame-by-frame from the two cameras of the robot.


We acquire the left and right images frame-by-frame from the two cameras of the robot.

  We compute the raw depth map from the left/right disparity.


We compute the raw depth map from the left/right disparity.

  We locate the main person interacting with the robot using a face recognition algorithm and we discard the 3D points that do not belong to that person.


We locate the main person interacting with the robot using a face recognition algorithm and we discard the 3D points that do not belong to that person.

  We group the 3D points into regions that represent different parts of the body  (rightmost images) , allowing us to determine the person’s body configuration. The regions are determined in an unsupervised way by applying the  bag-of-words model .


We group the 3D points into regions that represent different parts of the body (rightmost images), allowing us to determine the person’s body configuration. The regions are determined in an unsupervised way by applying the bag-of-words model.

  We predict what gesture the person is making by running a random forests classifier that we trained on a representative dataset.


We predict what gesture the person is making by running a random forests classifier that we trained on a representative dataset.

Results

   Recognition Accuracy  We obtained 92% recognition accuracy, using the right number of trees in the random forest to avoid under/over-fitting.


Recognition Accuracy
We obtained 92% recognition accuracy, using the right number of trees in the random forest to avoid under/over-fitting.


Quality of the vector representation
The figure above maps each component of the vector over time as a line graph. Each time plot corresponds to a different person performing the same three gestures. The vector representation exhib interesting patterns that are consistent among different people and are meaningful with regard to the gestures.

Take a look at my Report (PDF) if you want to find out more about this research.