Alexander Richard
PhD Student, University of Bonn
Computer Vision Group of Prof. Juergen Gall
richard [at] iai.uni-bonn.de
Publications
NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning
Alexander Richard, Hilde Kuehne, Ahsan Iqbal, Juergen Gall
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
Video learning is an important task in computer vision and has experienced increasing interest over the recent years. Since even a small amount of videos easily comprises several million frames, methods that do not rely on a frame-level annotation are of special importance. In this work, we propose a novel learning algorithm with a Viterbi-based loss that allows for online and incremental learning of weakly annotated video data. We moreover show that explicit context and length modeling leads to huge improvements in video segmentation and labeling tasks and include these models into our framework. On several action segmentation benchmarks, we obtain an improvement of up to 10% compared to current state-of-the-art methods.
Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints
Alexander Richard, Hilde Kuehne, Juergen Gall
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
Action detection and temporal segmentation of actions in videos are topics of increasing interest. While fully supervised systems have gained much attention lately, full annotation of each action within the video is costly and impractical for large amounts of video data. Thus, weakly supervised action detection and temporal segmentation methods are of great importance. While most works in this area assume an ordered sequence of occurring actions to be given, our approach only uses a set of actions. Such action sets provide much less supervision since neither action ordering nor the number of action occurrences are known. In exchange, they can be easily obtained, for instance, from meta-tags, while ordered sequences still require human annotation. We introduce a system that automatically learns to temporally segment and label actions in a video, where the only supervision that is used are action sets. An evaluation on three datasets shows that our method still achieves good results although the amount of supervision is significantly smaller than for other related methods.
When will you do what? - Anticipating Temporal Occurrences of Activities
Yazan Abu Farha, Alexander Richard, Juergen Gall
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
Analyzing human actions in videos has gained increased attention recently. While most works focus on classifying and labeling observed video frames or anticipating the very recent future, making long-term predictions over more than just a few seconds is a task with many practical applications that has not yet been addressed. In this paper, we propose two methods to predict a considerably large amount of future actions and their durations. Both, a CNN and an RNN are trained to learn future video labels based on previously seen content. We show that our methods generate accurate predictions of the future even for long videos with a huge amount of different actions and can even deal with noisy or erroneous input information.
Recurrent Residual Learning for Action Recognition
Ahsan Iqbal, Alexander Richard, Hilde Kuehne, Juergen Gall
German Conference on Pattern Recognition (GCPR), 2017 (Best Master's Award)
Action recognition is a fundamental problem in computer vision with a lot of potential applications such as video surveillance, human computer interaction, and robot learning. Given pre-segmented videos, the task is to recognize actions happening within videos. Historically, hand crafted video features were used to address the task of action recognition. With the success of Deep ConvNets as an image analysis method, a lot of extensions of standard ConvNets were purposed to process variable length video data. In this work, we propose a novel recurrent ConvNet architecture called recurrent residual networks to address the task of action recognition. The approach extends ResNet, a state of the art model for image classification. While the original formulation of ResNet aims at learning spatial residuals in its layers, we extend the approach by introducing recurrent connections that allow to learn a spatio-temporal residual. In contrast to fully recurrent networks, our temporal connections only allow a limited range of preceding frames to contribute to the output for the current frame, enabling efficient training and inference as well as limiting the temporal context to a reasonable local range around each frame. On a large-scale action recognition dataset, we show that our model improves over both, the standard ResNet architecture and a ResNet extended by a fully recurrent layer.
Weakly Supervised Action Learning with RNN based Fine-to-coarse Modeling
Alexander Richard, Hilde Kuehne, Juergen Gall
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 (Oral)
We present an approach for weakly supervised learning of human actions. Given a set of videos and an ordered list of the occurring actions, the goal is to infer start and end frames of the related action classes within the video and to train the respective action classifiers without any need for hand labeled frame boundaries. To address this task, we propose a combination of a discriminative representation of subactions, modeled by a recurrent neural network, and a coarse probabilistic model to allow for a temporal alignment and inference over long sequences. While this system alone already generates good results, we show that the performance can be further improved by approximating the number of subactions to the characteristics of the different action classes. To this end, we adapt the number of subaction classes by iterating realignment and reestimation during training. The proposed system is evaluated on two benchmark datasets, the Breakfast and the Hollywood extended dataset, showing a competitive performance on various weak learning tasks such as temporal action segmentation and action alignment.
Weakly Supervised Learning of Actions from Transcripts
Hilde Kuehne, Alexander Richard, Juergen Gall
Computer Vision and Image Understanding (CVIU), 2017
We present an approach for weakly supervised learning of human actions from video transcriptions. Our system is based on the idea that, given a sequence of input data and a transcript, i.e. a list of the order the actions occur in the video, it is possible to infer the actions within the video stream, and thus, learn the related action models without the need for any frame-based annotation. Starting from the transcript information at hand, we split the given data sequences uniformly based on the number of expected actions. We then learn action models for each class by maximizing the probability that the training video sequences are generated by the action models given the sequence order as defined by the transcripts. The learned model can be used to temporally segment an unseen video with or without transcript. We evaluate our approach on four distinct activity datasets, namely Hollywood Extended, MPII Cooking, Breakfast and CRIM13. We show that our system is able to align the scripted actions with the video data and that the learned models localize and classify actions competitively in comparison to models trained with full supervision, i.e. with frame level annotations, and that they outperform any current state-of-the-art approach for aligning transcripts with video data.
A Bag-of-Words Equivalent Recurrent Neural Network for Action Recognition
Alexander Richard, Juergen Gall
Computer Vision and Image Understanding (CVIU), 2017
The traditional bag-of-words approach has found a wide range of applications in computer vision. The standard pipeline consists of a generation of a visual vocabulary, a quantization of the features into histograms of visual words, and a classification step for which usually a support vector machine in combination with a non-linear kernel is used. Given large amounts of data, however, the model suffers from a lack of discriminative power. This applies particularly for action recognition, where the vast amount of video features needs to be subsampled for unsupervised visual vocabulary generation. Moreover, the kernel computation can be very expensive on large datasets. In this work, we propose a recurrent neural network that is equivalent to the traditional bag-of-words approach but enables for the application of discriminative training. The model further allows to incorporate the kernel computation into the neural network directly, solving the complexity issue and allowing to represent the complete classification system within a single network. We evaluate our method on four recent action recognition benchmarks and show that the conventional model as well as sparse coding methods are outperformed.
Temporal Action Detection using a Statistical Language Model
Alexander Richard, Juergen Gall
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
While current approaches to action recognition on pre-segmented video clips already achieve high accuracies, temporal action detection is still far from comparably good results. Automatically locating and classifying the relevant action segments in videos of varying lengths proves to be a challenging task. We propose a novel method for temporal action detection including statistical length and language modeling to represent temporal and contextual structure. Our approach aims at globally optimizing the joint probability of three components, a length and language model and a discriminative action model, without making intermediate decisions. The problem of finding the most likely action sequence and the corresponding segment boundaries in an exponentially large search space is addressed by dynamic programming. We provide an extensive evaluation of each model component on Thumos 14, a large action detection dataset, and report state-of-the-art results on three datasets.
A BoW-equivalent Recurrent Neural Network for Action Recognition
Alexander Richard, Juergen Gall
British Machine Vision Conference (BMVC), 2015
Bag-of-words (BoW) models are widely used in the field of computer vision. A BoW model consists of a visual vocabulary that is generated by unsupervised clustering the features of the training data, e.g., by using kMeans. The clustering methods, however, struggle with large amounts of data, in particular, in the context of action recognition. In this paper, we propose a transformation of the standard BoW model into a neural network, enabling discriminative training of the visual vocabulary on large action recognition datasets. We show that our model is equivalent to the original BoW model but allows for the application of supervised neural network training. Our model outperforms the conventional BoW model and sparse coding methods on recent action recognition benchmarks.
Mean-normalized Stochastic Gradient for Large-scale Deep Learning
Simon Wiesler, Alexander Richard, Ralf Schlüter, Hermann Ney
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
Deep neural networks are typically optimized with stochastic gradient descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on analytic results showing that a non-zero mean of features is harmful for the optimization. We prove convergence of our algorithm in a convex setting. In our experiments we show that our proposed algorithm converges faster than SGD. Further, in contrast to earlier work, our algorithm allows for training models with a factorized structure from scratch. We found this structure to be very useful not only because it accelerates training and decoding, but also because it is a very effective means against overfitting. Combining our proposed optimization algorithm with this model structure, model size can be reduced by a factor of eight and still improvements in recognition error rate are obtained. Additional gains are obtained by improving the Newbob learning rate strategy.
RASR/NN: The RWTH Neural Network Toolkit for Speech Recognition
Simon Wiesler, Alexander Richard, Pavel Golik, Ralf Schlüter, Hermann Ney
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
This paper describes the new release of RASR - the open source version of the well-proven speech recognition toolkit developed and used at RWTH Aachen University. The focus is put on the implementation of the NN module for training neural network acoustic models. We describe code design, configuration, and features of the NN module. The key feature is a high flexibility regarding the network topology, choice of activation functions, training criteria, and optimization algorithm, as well as a built-in support for efficient GPU computing. The evaluation of run-time performance and recognition accuracy is performed exemplary with a deep neural network as acoustic model in a hybrid NN/HMM system. The results show that RASR achieves a state-of-the-art performance on a real-world large vocabulary task, while offering a complete pipeline for building and applying large scale speech recognition systems.
A Critical Evaluation of Stochastic Algorithms for Convex Optimization
Simon Wiesler, Alexander Richard, Ralf Schlüter, Hermann Ney
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013
Log-linear models find a wide range of applications in pattern recognition. The training of log-linear models is a convex optimization problem. In this work, we compare the performance of stochastic and batch optimization algorithms. Stochastic algorithms are fast on large data sets but can not be parallelized well. In our experiments on a broadcast conversations recognition task, stochastic methods yield competitive results after only a short training period, but when spending enough computational resources for parallelization, batch algorithms are competitive with stochastic algorithms. We obtained slight improvements by using a stochastic second order algorithm. Our best log-linear model outperforms the maximum likelihood trained Gaussian mixture model baseline although being ten times smaller.
Feature Selection for Log-linear Acoustic Models
Simon Wiesler, Alexander Richard, Yotaro Kubo, Ralf Schlüter, Hermann Ney
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
Log-linear acoustic models have been shown to be competitive with Gaussian mixture models in speech recognition. Their high training time can be reduced by feature selection. We compare a simple univariate feature selection algorithm with ReliefF - an efficient multivariate algorithm. An alternative to feature selection is l1-regularized training, which leads to sparse models. We observe that this gives no speedup when sparse features are used, hence feature selection methods are preferable. For dense features, l1-regularization can reduce training and recognition time. We generalize the well known Rprop algorithm for the optimization of l1-regularized functions. Experiments on the Wall Street Journal corpus showed that a large number of sparse features could be discarded without loss of performance. A strong regularization led to slight performance degradations, but can be useful on large tasks, where training the full model is not tractable.
Projects
Squirrel
A large action recognition framework featuring CNNs, RNNs, and efficient Viterbi decodings supporting left-regular grammars and length modeling.
Queuing-Tool
A priority scheduler allowing to distribute resources of a machine among several jobs of different users. Supports multiple GPUs and multi-threaded programs. Using simple extensions to a bash script, users can define the resources to allocate, how many job instances to run in parallel, or which jobs have to wait for other jobs to be finished before being started.
Education/Work

Mar 2018 - Present

Research Intern at Facebook Reality Labs

Six month research internship at Facebook Reality Labs (former Oculus Research) in Pittsburgh.

Mar 2014 - Present

PhD Student, University of Bonn

Researcher in the field of video analytics and action recognition.
Research focus on temporal action segmentation and action labeling for long, untrimmed videos containing multiple action instances.

Oct 2011 - Feb 2014

Master's Degree, RWTH Aachen University

Thesis: Improved Optimization of Neural Networks
Implementation of a neural network module for the speech recognition software RASR; design, analysis, and evaluation of a novel optimization algorihm called mean-normalized stochastic gradient descent (MN-SGD)

Oct 2008 - Sep 2011

Bachelor's Degree, RWTH Aachen University

Thesis: Optimization of Log-linear Models and Online Learning
Analytical and empirical evaluation of various optimization methods for log-linear models.