Project Description

Project realised at the High Performance Humanoid Technologies Lab.

Action understanding is an important cognitive faculty which can help robots efficiently encode, store, and retrieve observed human demonstrations. This is of great interest in cognitive robotics for creating memory units to encapsulate gained information from past experiences, which can be then recalled to adapt ongoing and future behaviors. We introduce a novel deep neural network architecture for encoding, storing, and recalling past action experiences in an episodic memory-like manner. The network creates a low-dimensional latent space representation of the observed actions. Such a formulation in the latent space allows robot to classify different action types and retrieve the most similar episodes to the query action (see model figure). The proposed deep network further helps robots predict and generate the next possible frames of the currently observed action.

Contribution: The contribution of this work is manifold: (1) We implement a new deep network to encode action frames in a low-dimensional latent vector space. (2) Such a vector representation is used to reconstruct the video frames in an auto-encoder manner. (3) We show that the same latent vectors can also be employed to generate the upcoming future video frames. (4) We benchmark the proposed network on two different, well known action datasets. (5) We propose a mechanism for matching and retrieving visual experiences and finally evaluate its performance on a real humanoid robot ARMAR-III.

[arXiv] [PDF]

Trainings Overview

All trainings have been executed with our conv5/conv LSTM model with an FC LSTM layer of 1000 units, yielding 1000-dimensional latent representations. The ratio of images given to the encoder / decoder (prediction) / decoder (future) for our models is 5/5/5.

All trainings have been executed with our conv5/conv LSTM model with an FC LSTM layer of 1000 units, yielding 1000-dimensional latent representations. The ratio of images given to the encoder / decoder (prediction) / decoder (future) for our models is 5/5/5.

Model Configurations Overview

Models used during training – Layers from encoder input (left) to output (right). For conv and conv LSTM layers the filter size kxk is provided whereas fully connected (FC) layers and fully connected LSTM are characterized by their layer size n.

Models used during training - Layers from encoder input (left) to output (right). For conv and conv LSTM layers the filter size kxk is provided whereas fully connected (FC) layers and fully connected LSTM are characterized by their layer size n.