Alexander Richard
Research Scientist at Meta Reality Labs Research, Pittsburgh
richardalex [_at_] fb.com Google Scholar Twitter
Publications
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, Alexander Richard
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.
@inproceedings{ng2024audio2photoreal, title={From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations}, author={Ng, Evonne and Romero, Javier and Bagautdinov, Timur and Bai, Shaojie and Darrell, Trevor and Kanazawa, Angjoo and Richard, Alexander}, booktitle={IEEE Conference on Computer Vision and Pattern Recognition}, year={2024} }
Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthesis and impulse response generation which previously relied on synthetic data. In our evaluation, we thoroughly assessed existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. We also conducted experiments to investigate the impact of incorporating visual data (i.e., images and depth) into neural acoustic field models. Additionally, we demonstrated the effectiveness of a simple sim2real approach, where a model is pre-trained with simulated data and fine-tuned with sparse real-world data, resulting in significant improvements in the few-shot learning approach. RAF is the first dataset to provide densely captured room acoustic data, making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques. We will make our dataset publicly available.
@inproceedings{chan2024raf, title={Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark}, author={Chen, Ziyang and Gebru, Israel D. and Richardt, Christian and Kumar, Anurag and Laney, William and Owens, Andrew and Richard, Alexander}, booktitle={IEEE Conference on Computer Vision and Pattern Recognition}, year={2024} }
ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter
Yi-Chiao Wu, Dejan Markovic, Steven Krenn, Israel D. Gebru, Alexander Richard
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
Although recent mainstream waveform-domain end-to-end (E2E) neural audio codecs achieve impressive coded audio quality with a very low bitrate, the quality gap between the coded and natural audio is still significant. A generative adversarial network (GAN) training is usually required for these E2E neural codecs because of the difficulty of direct phase modeling. However, such adversarial learning hinders these codecs from preserving the original phase information. To achieve human-level naturalness with a reasonable bitrate, preserve the original phase, and get rid of the tricky and opaque GAN training, we develop a score-based diffusion post-filter (SPF) in the complex spectral domain and combine our previous AudioDec with the SPF to propose ScoreDec, which can be trained using only spectral and score-matching losses. Both the objective and subjective experimental results show that ScoreDec with a 24kbps bitrate encodes and decodes full-band 48kHz speech with human-level naturalness and well-preserved phase information.
@inproceedings{wu2024scoredec, title={ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter}, author={Wu, Yi-Chiao and Markovic, Dejan and Krenn, Steven and Gebru, Israel D. and Richard, Alexander}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing}, year={2024} }
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio
Xudong Xu, Dejan Markovic, Jacob Sandakly, Todd Keebler, Steven Krenn, Alexander Richard
Conference on Neural Information Processing Systems (NeurIPS), 2023 (Spotlight)
While 3D human body modeling has received much attention in computer vision, modeling the acoustic equivalent, i.e. modeling 3D spatial audio produced by body motion and speech, has fallen short in the community. To close this gap, we present a model that can generate accurate 3D spatial audio for full human bodies. The system consumes, as input, audio signals from headset microphones and body pose, and produces, as output, a 3D sound field surrounding the transmitter's body, from which spatial audio can be rendered at any arbitrary position in the 3D space. We collect a first-of-its-kind multimodal dataset of human bodies, recorded with multiple cameras and a spherical array of 345 microphones. In an empirical evaluation, we demonstrate that our model can produce accurate body-induced sound fields when trained with a suitable loss.
@inproceedings{xu2023soundingbodies, title={Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio}, author={Xu, Xudong and Markovic, Dejan and Sandakly, Jacob and Keebler, Todd and Krenn, Steven and Richard, Alexander}, booktitle={Conference on Neural Information Processing Systems}, year={2023} }
AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec
Yi-Chiao Wu, Israel D. Gebru, Dejan Markovic, Alexander Richard
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023
A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e.\ the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e.\ encoding and decoding the signal needs to be fast enough to enable communication without or with only minimal noticeable delay; and (3) reconstruction quality of the signal. In this work, we propose an open-source, streamable, and real-time neural audio codec that achieves strong performance along all three axes: it can reconstruct highly natural sounding 48~kHz speech signals while operating at only 12~kbps and running with less than 6~ms (GPU)/10~ms (CPU) latency. An efficient training paradigm is also demonstrated for developing such neural audio codecs for real-world scenarios. Both objective and subjective evaluations using the VCTK corpus are provided. To sum up, AudioDec is a well-developed plug-and-play benchmark for audio codec applications.
@inproceedings{wu2023audiodec, title={AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec}, author={Wu, Yi-Chiao and Gebru, Israel D. and Markovic, Dejan and Richard, Alexander}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing}, year={2023} }
NORD: NOn-matching Reference based Relative Depth estimation From Binaural Speech
Pranay Manocha, Israel D. Gebru, Anurag Kumar, Dejan Markovic, Alexander Richard
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023
We propose NORD: a novel framework for estimating the relative depth between two binaural speech recordings. In contrast to existing depth estimation techniques, ours only requires audio signals as input. We trained the framework to solve depth preference (i.e. which input perceptually sounds closer to the listener’s head), and quantification tasks (i.e. quantifying the depth difference between the inputs). In addition, training leverages recent advances in metric and multi-task learning, which allows the framework to be invariant to both signal content (i.e. non-matched reference) and directional cues (i.e. azimuth and elevation). Our framework has additional useful qualities that make it suitable for use as an objective metric to benchmark binaural audio systems, particularly depth perception and sound externalization, which we demonstrate through experiments. We also show that NORD generalizes well under different reverberation and environments. The results from preference and quantification tasks correlate well with measured results.
@inproceedings{manocha2023nord, title={Nord: Non-Matching Reference Based Relative Depth Estimation from Binaural Speech}, author={Manocha, Pranay and Gebru, Israel D. and Kumar, Anurag and Markovic, Dejan and Richard, Alexander}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing}, year={2023} }
Novel-View Acoustic Synthesis
Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023
We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.
@inproceedings{chen2023novelview, title={Novel-View Acoustic Synthesis}, author = {Chen, Changan and Richard, Alexander and Shapovalov, Roman and Ithapu, Vamsi Krishna and Neverova, Natalia and Grauman, Kristen and Vedaldi, Andrea}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2023}, }
Multiface: A Dataset for Neural Face Rendering
Cheng-hsin Wuu, Ningyuan Zheng, ..., Alexander Richard, ..., Yaser Sheikh (30 authors)
Technical Report, arXiv, 2022
Photorealistic avatars of human faces have come a long way in recent years, yet research along this area is limited by a lack of publicly available, high-quality datasets covering both, dense multi-view camera captures, and rich facial expressions of the captured subjects. In this work, we present Multiface, a new multi-view, high-resolution human face dataset collected from 13 identities at Reality Labs Research for neural face rendering. We introduce Mugsy, a large scale multi-camera apparatus to capture high-resolution synchronized videos of a facial performance. The goal of Multiface is to close the gap in accessibility to high quality data in the academic community and to enable research in VR telepresence. Along with the release of the dataset, we conduct ablation studies on the influence of different model architectures toward the model's interpolation capacity of novel viewpoint and expressions. With a conditional VAE model serving as our baseline, we found that adding spatial bias, texture warp field, and residual connections improves performance on novel view synthesis. Our code and data is available at https://github.com/facebookresearch/multiface.
@inproceedings{wuu2022multiface, title={Multiface: A Dataset for Neural Face Rendering}, author = {Wuu, Cheng-hsin and Zheng, Ningyuan and Ardisson, Scott and Bali, Rohan and Belko, Danielle and Brockmeyer, Eric and Evans, Lucas and Godisart, Timothy and Ha, Hyowon and Hypes, Alexander and Koska, Taylor and Krenn, Steven and Lombardi, Stephen and Luo, Xiaomin and McPhail, Kevyn and Millerschoen, Laura and Perdoch, Michal and Pitts, Mark and Richard, Alexander and Saragih, Jason and Saragih, Junko and Shiratori, Takaaki and Simon, Tomas and Stewart, Matt and Trimble, Autumn and Weng, Xinshuo and Whitewolf, David and Wu, Chenglei and Yu, Shoou-I and Sheikh, Yaser}, booktitle={arXiv}, year={2022}, doi = {10.48550/ARXIV.2207.11243}, url = {https://arxiv.org/abs/2207.11243} }
LiP-Flow: Learning Inference-time Priors for Codec Avatars via Normalizing Flows in Latent Space
Emre Aksan, Shugao Ma, Akin Caliskan, Stanislav Pidhorskyi, Alexander Richard, Shih-En Wei, Jason Saragih, Otmar Hilliges
European Conference on Computer Vision (ECCV), 2022
Neural face avatars that are trained from multi-view data captured in camera domes can produce photo-realistic 3D reconstructions. However, at inference time, they must be driven by limited inputs such as partial views recorded by headset-mounted cameras or a frontfacing camera, and sparse facial landmarks. To mitigate this asymmetry, we introduce a prior model that is conditioned on the runtime inputs and tie this prior space to the 3D face model via a normalizing flow in the latent space. Our proposed model, LiP-Flow, consists of two encoders that learn representations from the rich training-time and impoverished inference-time observations. A normalizing flow bridges the two representation spaces and transforms latent samples from one domain to another, allowing us to define a latent likelihood objective. We trained our model end-to-end to maximize the similarity of both representation spaces and the reconstruction quality, making the 3D face model aware of the limited driving signals. We conduct extensive evaluations where the latent codes are optimized to reconstruct 3D avatars from partial or sparse observations. We show that our approach leads to an expressive and effective prior, capturing facial dynamics and subtle expressions better.
@inproceedings{aksan2022lipflow, title={LiP-Flow: Learning Inference-time Priors for Codec Avatars via Normalizing Flows in Latent Space}, author={Aksan, Emre and Ma, Shugao and Caliskan, Akin and Pidhorskyi, Stanislav and Richard, Alexander and Wei, Shih-En and Saragih, Jason and Hilliges, Otmar}, booktitle={European Conference on Computer Vision}, year={2022} }
End-to-End Binaural Speech Synthesis
Wen-Chin Huang, Dejan Markovic, Israel D. Gebru, Anjali Menon, Alexander Richard
Interspeech, 2022
In this work, we present an end-to-end binaural speech synthesis system that combines a low-bitrate audio codec with a powerful binaural decoder that is capable of accurate speech binauralization while faithfully reconstructing environmental factors like ambient noise or reverb. The network is a modified vector-quantized variational autoencoder, trained with several carefully designed objectives, including an adversarial loss. We evaluate the proposed system on an internal binaural dataset with objective metrics and a perceptual study. Results show that the proposed approach matches the ground truth data more closely than previous methods. In particular, we demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
@inproceedings{huang2022endtoend, title={End-to-End Binaural Speech Synthesis}, author={Huang, Wen-Chin and Markovic, Dejan and Gebru, Israel D and Menon, Anjali and Richard, Alexander}, booktitle={Interspeech}, year={2022} }
Implicit Neural Spatial Filtering for Multichannel Source Separation in the Waveform Domain
Dejan Markovic, Alexandre Defossez, Alexander Richard
Interspeech, 2022
We present a single-stage casual waveform-to-waveform multichannel model that can separate moving sound sources based on their broad spatial locations in a dynamic acoustic scene. We divide the scene into two spatial regions containing, respectively, the target and the interfering sound sources. The model is trained end-to-end and performs spatial processing implicitly, without any components based on traditional processing or use of hand-crafted spatial features. We evaluate the proposed model on a real-world dataset and show that the model matches the performance of an oracle beamformer followed by a state-of-the-art single-channel enhancement network.
@inproceedings{markovic2022multichannel, title={Implicit Neural Spatial Filtering for Multichannel Source Separation in the Waveform Domain}, author={Markovic, Dejan and Defossez, Alexandre and Richard, Alexander}, booktitle={Interspeech}, year={2022} }
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
Karren Yang, Dejan Markovic, Steven Krenn, Vasu Agrawal, Alexander Richard
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral)
Since facial actions such as lip movements contain significant information about speech content, it is not surprising that audio-visual speech enhancement methods are more accurate than their audio-only counterparts. Yet, state-of-the-art approaches still struggle to generate clean, realistic speech without noise artifacts and unnatural distortions in challenging acoustic environments. In this paper, we propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech codec, enabling efficient synthesis of clean, realistic speech from noisy signals. Given the importance of speaker-specific cues in speech, we focus on developing personalized models that work well for individual speakers. We demonstrate the efficacy of our approach on a new audio-visual speech dataset collected in an unconstrained, large vocabulary setting, as well as existing audio-visual datasets, outperforming speech enhancement baselines on both quantitative metrics and human evaluation studies. Please see the supplemental video for qualitative results.
@inproceedings{yang2022audiovisual, title={Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis}, author={Yang, Karren and Markovic, Dejan and Krenn, Steven and Agrawal, Vasu and Richard, Alexander}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2022} }
Conditional Diffusion Probabilistic Model for Speech Enhancement
Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu, Yu Tsao
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022
Speech enhancement is a critical component of many user-oriented audio applications, yet current systems still suffer from distorted and unnatural outputs. While generative models have shown strong potential in speech synthesis, they are still lagging behind in speech enhancement. This work leverages recent advances in diffusion probabilistic models, and proposes a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes. More specifically, we propose a generalized formulation of the diffusion probabilistic model named conditional diffusion probabilistic model that, in its reverse process, can adapt to non-Gaussian real noises in the estimated speech signal. In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models, and investigate the generalization capability of our models to other datasets with noise characteristics unseen during training.
@inproceedings{richard2022cdiff, title={Conditional Diffusion Probabilistic Model for Speech Enhancement}, author={Lu, Yen-Ju Lu and Wang, Zhong-Qiu and Watanabe, Shinji and Richard, Alexander, Yu, Cheng and Tsao, Yu}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing}, year={2022} }
Deep Impulse Responses: Estimating and Parameterizing Filters with Deep Networks
Alexander Richard, Peter Dodds, Vamsi Krishna Ithapu
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022
Impulse response estimation in high noise and in-the-wild settings, with minimal control of the underlying data distributions, is a challenging problem. We propose a novel framework for parameterizing and estimating impulse responses based on recent advances in neural representation learning. Our framework is driven by a carefully designed neural network that jointly estimates the impulse response and the (apriori unknown) spectral noise characteristics of an observed signal given the source signal. We demonstrate robustness in estimation, even under low signal-to-noise ratios, and show strong results when learning from spatio-temporal real-world speech data. Our framework provides a natural way to interpolate impulse responses on a spatial grid, while also allowing for efficiently compressing and storing them for real-time rendering applications in augmented and virtual reality.
@inproceedings{richard2022deepimpulse, title={Deep Impulse Responses: Estimating and Parameterizing Filters with Deep Networks}, author={Richard, Alexander and Dodds, Peter and Ithapu, Vamsi Krishna}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing}, year={2022} }
MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement
Alexander Richard, Michael Zollhoefer, Yandong Wen, Fernando de la Torre, Yaser Sheikh
IEEE International Conference on Computer Vision (ICCV), 2021
This paper presents a generic method for generating full facial 3D animation from speech. Existing approaches to audio-driven facial animation exhibit uncanny or static upper face animation, fail to produce accurate and plausible co-articulation or rely on person-specific models that limit their scalability. To improve upon existing models, we propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face. At the core of our approach is a categorical latent space for facial animation that disentangles audio-correlated and audio-uncorrelated information based on a novel cross-modality loss. Our approach ensures highly accurate lip motion, while also synthesizing plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion. We demonstrate that our approach outperforms several baselines and obtains state-of-the-art quality both qualitatively and quantitatively. A perceptual user study demonstrates that our approach is deemed more realistic than the current state-of-the-art in over 75% of cases. We recommend watching the supplemental video before reading the paper.
@inproceedings{richard2021meshtalk, title={MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement}, author={Alexander Richard and Michael Zollhoefer and Yandong Wen and Fernando de la Torre and Yaser Sheikh}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year={2021} }
Neural Synthesis of Binaural Speech from Mono Audio
Alexander Richard, Dejan Markovic, Israel D. Gebru, Steven Krenn, Gladstone Butler, Fernando de la Torre, Yaser Sheikh
International Conference on Learning Representations (ICLR), 2021 (Outstanding Paper Award)
We present a neural rendering approach for binaural sound synthesis that can produce realistic and spatially accurate binaural sound in realtime. The network takes, as input, a single-channel audio source and synthesizes, as output, two-channel binaural sound, conditioned on the relative position and orientation of the listener with respect to the source. We investigate deficiencies of the l2-loss on raw waveforms in a theoretical analysis and introduce an improved loss that overcomes these limitations. In an empirical evaluation, we establish that our approach is the first to generate spatially accurate waveform outputs (as measured by real recordings) and outperforms existing approaches by a considerable margin, both quantitatively and in a perceptual study. We will release a first-of-its-kind binaural audio dataset as a benchmark for future research.
@inproceedings{richard2021binaural, title={Neural Synthesis of Binaural Speech from Mono Audio}, author={Richard, Alexander and Markovic, Dejan and Gebru, Israel D and Krenn, Steven and Butler, Gladstone and de la Torre, Fernando and Sheikh, Yaser}, booktitle={International Conference on Learning Representations}, year={2021} }
Implicit HRTF Modeling Using Temporal Convolutional Networks
Israel D. Gebru, Dejan Markovic, Alexander Richard, Steven Krenn, Gladstone Butler, Fernando de la Torre, Yaser Sheikh
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
Estimation of accurate head-related transfer functions (HRTFs) is crucial to achieve realistic binaural acoustic experiences. HRTFs depend on source/listener locations and are therefore expensive and cumbersome to measure; traditional approaches require listener-dependent measurements of HRTFs at thousands of distinct spatial directions in an anechoic chamber. In this work, we present a data-driven approach to learn HRTFs implicitly with a neural network that achieves state of the art results compared to traditional approaches but relies on a much simpler data capture that can be performed in arbitrary, non-anechoic rooms. Despite that simpler and less acoustically ideal data capture, our deep learning based approach learns HRTFs of high quality. We show in a perceptual study that the produced binaural audio is ranked on par with traditional DSP approaches by humans and illustrate that interaural time differences (ITDs), interaural level differences (ILDs) and spectral clues are accurately estimated.
@inproceedings{richard2021implicit, title={Implicit HRTF Modeling Using Temporal Convolutional Networks}, author={Gebru, Israel D and Markovic, Dejan and Richard, Alexander and Krenn, Steven and Butler, Gladstone and de la Torre, Fernando and Sheikh, Yaser}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing}, year={2021} }
Audio- and Gaze-driven Facial Animation of Codec Avatars
Alexander Richard*, Colin Lea*, Shugao Ma, Juergen Gall, Fernando de la Torre, Yaser Sheikh
Winter Conference on Applications of Computer Vision (WACV), 2021
Codec Avatars are a recent class of learned, photorealistic face models that accurately represent the geometry and texture of a person in 3D (i.e., for virtual reality), and are almost indistinguishable from video. In this paper we describe the first approach to animate these parametric models in real-time which could be deployed on commodity virtual reality hardware using audio and/or eye tracking. Our goal is to display expressive conversations between individuals that exhibit important social signals such as laughter and excitement solely from latent cues in our lossy input signals. To this end we collected over 5 hours of high frame rate 3D face scans across three participants including traditional neutral speech as well as expressive and conversational speech. We investigate a multimodal fusion approach that dynamically identifies which sensor encoding should animate which parts of the face at any time. See the supplemental video which demonstrates our ability to generate full face motion far beyond the typically neutral lip articulations seen in competing work: https://research.fb.com/videos/audio-and-gaze-driven-facial-animation-of-codec-avatars/
@inproceedings{richard2021audiogaze, title={Audio- and Gaze-driven Facial Animation of Codec Avatars}, author={Alexander Richard and Colin Lea and Shugao Ma and Juergen Gall and Fernando de la Torre and Yaser Sheikh}, booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year={2021}, pages={41-50} }
Temporal Segmentation of Human Actions in Videos
Alexander Richard
PhD Thesis, 2019
Understanding human actions in videos is of great interest in various scenarios ranging from surveillance over quality control in production processes to content-based video search. Algorithms for automatic temporal action segmentation need to overcome severe difficulties in order to be reliable and provide sufficiently good quality. Not only can human actions occur in different scenes and surroundings, the definition on an action itself is also inherently fuzzy, leading to a significant amount of inter-class variations. Moreover, besides finding the correct action label for a pre-defined temporal segment in a video, localizing an action in the first place is anything but trivial. Different actions not only vary in their appearance and duration but also can have long-range temporal dependencies that span over the complete video. Further, getting reliable annotations of large amounts of video data is time consuming and expensive. The goal of this thesis is to advance current approaches to temporal action segmentation. We therefore propose a generic framework that models the three components of the task explicitly, ie long-range temporal dependencies are handled by a context model, variations in segment durations are represented by a length model, and short-term appearance and motion of actions are addressed with a visual model. While the inspiration for the context model mainly comes from word sequence models in natural language processing, the visual model builds upon recent advances in the classification of pre-segmented action clips. Considering that long-range temporal context is crucial, we avoid local segmentation decisions and find the globally optimal temporal segmentation of a video under the explicit models. Throughout the thesis, we provide explicit formulations and training strategies for the proposed generic action segmentation framework under different supervision conditions. First, we address the task of fully supervised temporal action segmentation, where frame-level annotations are available during training. We show that our approach can outperform early sliding window baselines and recent deep architectures and that explicit length and context modeling leads to substantial improvements. Considering that full frame-level annotation is expensive to obtain, we then formulate a weakly supervised training algorithm that uses ordered sequences of actions occurring in the video as only supervision. While a first approach reduces the weakly supervised setup to a fully supervised setup by generating a pseudo ground-truth during training, we propose a second approach that avoids this intermediate step and allows to directly optimize a loss based on the weak supervision. Closing the gap between the fully and the weakly supervised setup, we moreover evaluate semi-supervised learning, where video frames are sparsely annotated. With the motivation that the vast amount of video data on the Internet only comes with meta-tags or content keywords that do not provide any temporal ordering information, we finally propose a method for action segmentation that learns from unordered sets of actions only. All approaches are evaluated on several commonly used benchmark datasets. With the proposed methods, we reach state-of-the-art performance for both, fully and weakly supervised action segmentation.
@phdthesis{richard2019temporal, title={Temporal Segmentation of Human Actions in Videos}, author={Richard, Alexander}, year={2019}, school={Universit{\"a}ts-und Landesbibliothek Bonn} }
Enhancing Temporal Action Localization with Transfer Learning from Action Recognition
Alexander Richard*, Ahsan Iqbal*, Juergen Gall
Comprehensive Video Understanding in the Wild (CoVieW, ICCV workshop), 2019
Temporal localization of actions in videos has been of increasing interest in recent years. However, most existing approaches rely on complex architectures that are either expensive to train, inefficient at inference time, or require thorough and careful architecture engineering. Classical action recognition on pre-segmented clips, on the other hand, benefits from sophisticated deep architectures that paved the way for highly reliable video clip classifiers. In this paper, we propose to use transfer learning to leverage the good results from action recognition for temporal localization. We apply a network that is inspired by the classical bag-of-words model for transfer learning and show that the resulting framewise class posteriors already provide good results without explicit temporal modeling. Further, we show that combining these features with a deep but simple convolutional network achieves state of the art results on two challenging action localization datasets.
@inproceedings{richard2019enhancing, title={Enhancing Temporal Action Localization with Transfer Learning from Action Recognition} author={Richard, Alexander and Iqbal, Ahsan, and Gall, Juergen}, booktitle={Comprehensive Video Understanding in the Wild}, year={2019} }
A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation
Hilde Kuehne*, Alexander Richard*, Juergen Gall
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2019
Action recognition has become a rapidly developing research field within the last decade. But with the increasing demand for large scale data, the need of hand annotated data for the training becomes more and more impractical. One way to avoid frame-based human annotation is the use of action order information to learn the respective action classes. In this context, we propose a hierarchical approach to address the problem of weakly supervised learning of human actions from ordered action labels by structuring recognition in a coarse-to-fine manner. Given a set of videos and an ordered list of the occurring actions, the task is to infer start and end frames of the related action classes within the video and to train the respective action classifiers without any need for hand labeled frame boundaries. We address this problem by combining a framewise RNN model with a coarse probabilistic inference. This combination allows for the temporal alignment of long sequences and thus, for an iterative training of both elements. While this system alone already generates good results, we show that the performance can be further improved by approximating the number of subactions to the characteristics of the different action classes as well as by the introduction of a regularizing length prior. The proposed system is evaluated on two benchmark datasets, the Breakfast and the Hollywood extended dataset, showing a competitive performance on various weak learning tasks such as temporal action segmentation and action alignment.
@article{richard2019hybrid, title={A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation}, author={Kuehne, Hilde and Richard, Alexander and Gall, Juergen}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, year={2019}, }
NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning
Alexander Richard, Hilde Kuehne, Ahsan Iqbal, Juergen Gall
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
Video learning is an important task in computer vision and has experienced increasing interest over the recent years. Since even a small amount of videos easily comprises several million frames, methods that do not rely on a frame-level annotation are of special importance. In this work, we propose a novel learning algorithm with a Viterbi-based loss that allows for online and incremental learning of weakly annotated video data. We moreover show that explicit context and length modeling leads to huge improvements in video segmentation and labeling tasks and include these models into our framework. On several action segmentation benchmarks, we obtain an improvement of up to 10% compared to current state-of-the-art methods.
@inproceedings{richard2018nnviterbi, title={NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning}, author={Richard, Alexander and Kuehne, Hilde, and Iqbal, Ahsan, and Gall, Juergen}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, year={2018} }
Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints
Alexander Richard, Hilde Kuehne, Juergen Gall
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
Action detection and temporal segmentation of actions in videos are topics of increasing interest. While fully supervised systems have gained much attention lately, full annotation of each action within the video is costly and impractical for large amounts of video data. Thus, weakly supervised action detection and temporal segmentation methods are of great importance. While most works in this area assume an ordered sequence of occurring actions to be given, our approach only uses a set of actions. Such action sets provide much less supervision since neither action ordering nor the number of action occurrences are known. In exchange, they can be easily obtained, for instance, from meta-tags, while ordered sequences still require human annotation. We introduce a system that automatically learns to temporally segment and label actions in a video, where the only supervision that is used are action sets. An evaluation on three datasets shows that our method still achieves good results although the amount of supervision is significantly smaller than for other related methods.
@inproceedings{richard2018actionsets, title={Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints}, author={Richard, Alexander and Kuehne, Hilde, and Gall, Juergen}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, year={2018} }
When will you do what? - Anticipating Temporal Occurrences of Activities
Yazan Abu Farha, Alexander Richard, Juergen Gall
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
Analyzing human actions in videos has gained increased attention recently. While most works focus on classifying and labeling observed video frames or anticipating the very recent future, making long-term predictions over more than just a few seconds is a task with many practical applications that has not yet been addressed. In this paper, we propose two methods to predict a considerably large amount of future actions and their durations. Both, a CNN and an RNN are trained to learn future video labels based on previously seen content. We show that our methods generate accurate predictions of the future even for long videos with a huge amount of different actions and can even deal with noisy or erroneous input information.
@inproceedings{richard2018whenwillyoudowhat, title={When will you do what? - Anticipating Temporal Occurrences of Activities}, author={Abu Farah, Yazan and Richard, Alexander and Gall, Juergen}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, year={2018} }
Recurrent Residual Learning for Action Recognition
Ahsan Iqbal, Alexander Richard, Hilde Kuehne, Juergen Gall
German Conference on Pattern Recognition (GCPR), 2017 (Best Master's Award)
Action recognition is a fundamental problem in computer vision with a lot of potential applications such as video surveillance, human computer interaction, and robot learning. Given pre-segmented videos, the task is to recognize actions happening within videos. Historically, hand crafted video features were used to address the task of action recognition. With the success of Deep ConvNets as an image analysis method, a lot of extensions of standard ConvNets were purposed to process variable length video data. In this work, we propose a novel recurrent ConvNet architecture called recurrent residual networks to address the task of action recognition. The approach extends ResNet, a state of the art model for image classification. While the original formulation of ResNet aims at learning spatial residuals in its layers, we extend the approach by introducing recurrent connections that allow to learn a spatio-temporal residual. In contrast to fully recurrent networks, our temporal connections only allow a limited range of preceding frames to contribute to the output for the current frame, enabling efficient training and inference as well as limiting the temporal context to a reasonable local range around each frame. On a large-scale action recognition dataset, we show that our model improves over both, the standard ResNet architecture and a ResNet extended by a fully recurrent layer.
@inproceedings{richard2017recurrent, title={Recurrent Residual Learning for Action Recognition}, author={Iqbal, Ahsan and Richard, Alexander and Kuehne, Hilde and Gall, Juergen}, booktitle={German Conference on Pattern Recognition}, year={2017} }
Weakly Supervised Action Learning with RNN based Fine-to-coarse Modeling
Alexander Richard, Hilde Kuehne, Juergen Gall
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 (Oral)
We present an approach for weakly supervised learning of human actions. Given a set of videos and an ordered list of the occurring actions, the goal is to infer start and end frames of the related action classes within the video and to train the respective action classifiers without any need for hand labeled frame boundaries. To address this task, we propose a combination of a discriminative representation of subactions, modeled by a recurrent neural network, and a coarse probabilistic model to allow for a temporal alignment and inference over long sequences. While this system alone already generates good results, we show that the performance can be further improved by approximating the number of subactions to the characteristics of the different action classes. To this end, we adapt the number of subaction classes by iterating realignment and reestimation during training. The proposed system is evaluated on two benchmark datasets, the Breakfast and the Hollywood extended dataset, showing a competitive performance on various weak learning tasks such as temporal action segmentation and action alignment.
@inproceedings{richard2017weakly, title={Weakly supervised action learning with RNN based fine-to-coarse modeling}, author={Richard, Alexander and Kuehne, Hilde, and Gall, Juergen}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, year={2017} }
Weakly Supervised Learning of Actions from Transcripts
Hilde Kuehne, Alexander Richard, Juergen Gall
Computer Vision and Image Understanding (CVIU), 2017
We present an approach for weakly supervised learning of human actions from video transcriptions. Our system is based on the idea that, given a sequence of input data and a transcript, i.e. a list of the order the actions occur in the video, it is possible to infer the actions within the video stream, and thus, learn the related action models without the need for any frame-based annotation. Starting from the transcript information at hand, we split the given data sequences uniformly based on the number of expected actions. We then learn action models for each class by maximizing the probability that the training video sequences are generated by the action models given the sequence order as defined by the transcripts. The learned model can be used to temporally segment an unseen video with or without transcript. We evaluate our approach on four distinct activity datasets, namely Hollywood Extended, MPII Cooking, Breakfast and CRIM13. We show that our system is able to align the scripted actions with the video data and that the learned models localize and classify actions competitively in comparison to models trained with full supervision, i.e. with frame level annotations, and that they outperform any current state-of-the-art approach for aligning transcripts with video data.
@article{richard2017weaklysupervised, title={Weakly supervised learning of actions from transcripts}, author={Kuehne, Hilde and Richard, Alexander and Gall, Juergen}, journal={Computer Vision and Image Understanding}, year={2017}, publisher={Elsevier} }
A Bag-of-Words Equivalent Recurrent Neural Network for Action Recognition
Alexander Richard, Juergen Gall
Computer Vision and Image Understanding (CVIU), 2017
The traditional bag-of-words approach has found a wide range of applications in computer vision. The standard pipeline consists of a generation of a visual vocabulary, a quantization of the features into histograms of visual words, and a classification step for which usually a support vector machine in combination with a non-linear kernel is used. Given large amounts of data, however, the model suffers from a lack of discriminative power. This applies particularly for action recognition, where the vast amount of video features needs to be subsampled for unsupervised visual vocabulary generation. Moreover, the kernel computation can be very expensive on large datasets. In this work, we propose a recurrent neural network that is equivalent to the traditional bag-of-words approach but enables for the application of discriminative training. The model further allows to incorporate the kernel computation into the neural network directly, solving the complexity issue and allowing to represent the complete classification system within a single network. We evaluate our method on four recent action recognition benchmarks and show that the conventional model as well as sparse coding methods are outperformed.
@article{richard2017bag, title={A bag-of-words equivalent recurrent neural network for action recognition}, author={Richard, Alexander and Gall, Juergen}, journal={Computer Vision and Image Understanding}, volume={156}, pages={79--91}, year={2017}, publisher={Elsevier} }
Temporal Action Detection using a Statistical Language Model
Alexander Richard, Juergen Gall
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
While current approaches to action recognition on pre-segmented video clips already achieve high accuracies, temporal action detection is still far from comparably good results. Automatically locating and classifying the relevant action segments in videos of varying lengths proves to be a challenging task. We propose a novel method for temporal action detection including statistical length and language modeling to represent temporal and contextual structure. Our approach aims at globally optimizing the joint probability of three components, a length and language model and a discriminative action model, without making intermediate decisions. The problem of finding the most likely action sequence and the corresponding segment boundaries in an exponentially large search space is addressed by dynamic programming. We provide an extensive evaluation of each model component on Thumos 14, a large action detection dataset, and report state-of-the-art results on three datasets.
@inproceedings{richard2016temporal, title={Temporal action detection using a statistical language model}, author={Richard, Alexander and Gall, Juergen}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, pages={3131--3140}, year={2016} }
A BoW-equivalent Recurrent Neural Network for Action Recognition
Alexander Richard, Juergen Gall
British Machine Vision Conference (BMVC), 2015
Bag-of-words (BoW) models are widely used in the field of computer vision. A BoW model consists of a visual vocabulary that is generated by unsupervised clustering the features of the training data, e.g., by using kMeans. The clustering methods, however, struggle with large amounts of data, in particular, in the context of action recognition. In this paper, we propose a transformation of the standard BoW model into a neural network, enabling discriminative training of the visual vocabulary on large action recognition datasets. We show that our model is equivalent to the original BoW model but allows for the application of supervised neural network training. Our model outperforms the conventional BoW model and sparse coding methods on recent action recognition benchmarks.
@inproceedings{richard2015bow, title={A BoW-equivalent Recurrent Neural Network for Action Recognition.}, author={Richard, Alexander and Gall, Juergen}, booktitle={British Machine Vision Conference}, volume={6}, year={2015} }
Mean-normalized Stochastic Gradient for Large-scale Deep Learning
Simon Wiesler, Alexander Richard, Ralf Schlüter, Hermann Ney
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
Deep neural networks are typically optimized with stochastic gradient descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on analytic results showing that a non-zero mean of features is harmful for the optimization. We prove convergence of our algorithm in a convex setting. In our experiments we show that our proposed algorithm converges faster than SGD. Further, in contrast to earlier work, our algorithm allows for training models with a factorized structure from scratch. We found this structure to be very useful not only because it accelerates training and decoding, but also because it is a very effective means against overfitting. Combining our proposed optimization algorithm with this model structure, model size can be reduced by a factor of eight and still improvements in recognition error rate are obtained. Additional gains are obtained by improving the Newbob learning rate strategy.
@inproceedings{richard2014mnsgd, title={Mean-normalized stochastic gradient for large-scale deep learning}, author={Wiesler, Simon and Richard, Alexander and Schluter, Ralf and Ney, Hermann}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing}, pages={180--184}, year={2014} }
RASR/NN: The RWTH Neural Network Toolkit for Speech Recognition
Simon Wiesler, Alexander Richard, Pavel Golik, Ralf Schlüter, Hermann Ney
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
This paper describes the new release of RASR - the open source version of the well-proven speech recognition toolkit developed and used at RWTH Aachen University. The focus is put on the implementation of the NN module for training neural network acoustic models. We describe code design, configuration, and features of the NN module. The key feature is a high flexibility regarding the network topology, choice of activation functions, training criteria, and optimization algorithm, as well as a built-in support for efficient GPU computing. The evaluation of run-time performance and recognition accuracy is performed exemplary with a deep neural network as acoustic model in a hybrid NN/HMM system. The results show that RASR achieves a state-of-the-art performance on a real-world large vocabulary task, while offering a complete pipeline for building and applying large scale speech recognition systems.
@inproceedings{richard2014rasr, title={RASR/NN: The RWTH neural network toolkit for speech recognition}, author={Wiesler, Simon and Richard, Alexander and Golik, Pavel and Schluter, Ralf and Ney, Hermann}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing}, pages={3281--3285}, year={2014} }
A Critical Evaluation of Stochastic Algorithms for Convex Optimization
Simon Wiesler, Alexander Richard, Ralf Schlüter, Hermann Ney
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013
Log-linear models find a wide range of applications in pattern recognition. The training of log-linear models is a convex optimization problem. In this work, we compare the performance of stochastic and batch optimization algorithms. Stochastic algorithms are fast on large data sets but can not be parallelized well. In our experiments on a broadcast conversations recognition task, stochastic methods yield competitive results after only a short training period, but when spending enough computational resources for parallelization, batch algorithms are competitive with stochastic algorithms. We obtained slight improvements by using a stochastic second order algorithm. Our best log-linear model outperforms the maximum likelihood trained Gaussian mixture model baseline although being ten times smaller.
@inproceedings{richard2013critical, title={A critical evaluation of stochastic algorithms for convex optimization}, author={Wiesler, Simon and Richard, Alexander and Schl{\"u}ter, Ralf and Ney, Hermann}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing}, pages={6955--6959}, year={2013} }
Feature Selection for Log-linear Acoustic Models
Simon Wiesler, Alexander Richard, Yotaro Kubo, Ralf Schlüter, Hermann Ney
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
Log-linear acoustic models have been shown to be competitive with Gaussian mixture models in speech recognition. Their high training time can be reduced by feature selection. We compare a simple univariate feature selection algorithm with ReliefF - an efficient multivariate algorithm. An alternative to feature selection is l1-regularized training, which leads to sparse models. We observe that this gives no speedup when sparse features are used, hence feature selection methods are preferable. For dense features, l1-regularization can reduce training and recognition time. We generalize the well known Rprop algorithm for the optimization of l1-regularized functions. Experiments on the Wall Street Journal corpus showed that a large number of sparse features could be discarded without loss of performance. A strong regularization led to slight performance degradations, but can be useful on large tasks, where training the full model is not tractable.
@inproceedings{richard2011feature, title={Feature selection for log-linear acoustic models}, author={Wiesler, Simon and Richard, Alexander and Kubo, Yotaro and Schl{\"u}ter, Ralf and Ney, Hermann}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing}, pages={5324--5327}, year={2011} }
Education/Work

Oct 2019 - current

Research Scientist at Meta Reality Labs (formerly Facebook Reality Labs), Pittsburgh

Multimodal modeling for Social VR.

Mar 2019 - Sep 2019

Scientist at Amazon Alexa, Aachen

Working on Alexa at Amazon in Aachen, Germany.

Mar 2018 - Aug 2018

Research Intern at Facebook Reality Labs

Research internship at Facebook Reality Labs in Pittsburgh, working on multi-modal modeling for social VR.

Mar 2014 - Mar 2019

PhD Student, University of Bonn

Researcher in the field of video analytics and action recognition.
Research focus on temporal action segmentation and action labeling for long, untrimmed videos containing multiple action instances.

Oct 2011 - Feb 2014

Master's Degree, RWTH Aachen University

Thesis: Improved Optimization of Neural Networks
Implementation of a neural network module for the speech recognition software RASR; design, analysis, and evaluation of a novel optimization algorihm called mean-normalized stochastic gradient descent (MN-SGD)

Oct 2008 - Sep 2011

Bachelor's Degree, RWTH Aachen University

Thesis: Optimization of Log-linear Models and Online Learning
Analytical and empirical evaluation of various optimization methods for log-linear models.