Latest Artificial Intelligence (AI) Research From NVIDIA Shows How To Animate Portraits Using Speech And A Single Image

Artificial intelligence (AI) has been a topic of increasing importance in recent years. Technological advances have made it possible to solve problems that were previously considered unsolvable. As a result, AI is increasingly being used to automate decision-making in a variety of fields. One of those tasks is animate portraits, Realistic animations are automatically generated from individual portraits.

Given the complexity of the task, Animate a portrait is an open problem in computer vision. More recent works use voice cues to drive the animation process forward. These approaches attempt to learn how the input speech can be mapped to facial representations. An ideally generated video should have good lip sync to the audio, natural facial expressions and head movements, and high image quality.

Cutting-edge techniques in this field rely on end-to-end deep neural network architectures consisting of pre-processing networks used to convert the input audio sequence into usable tokens and learned emotion embedding to map these tokens into the appropriate poses. Some work focuses on animating 3D vertices of a face model. However, these methods require special training data, e.g. B. 3D facial models, which may not be available for many applications. Other approaches work with 2D faces and generate realistic lip movements according to the input audio signals. Despite the lip movement, their results lack realism when used with a single input image because the rest of the face remains stationary.

The aim of the presented method, called SPACExis to use 2D frames in a clever way to overcome the limitations of said state-of-the-art approaches while still achieving realistic results.

The architecture of the proposed method is shown in the figure below.

SPACEx takes an input speech clip and a facial image (with an optional emotion label) and creates an output video. It combines the advantages of the related work through the use of a three-tiered prediction framework.

First, normalized facial features are extracted from an input image (Speech2Landmarks in the figure above). The neural network uses the calculated landmarks to predict their movements per frame based on the input language and emotion label. The input speech is not fed directly to the landmark predictor. 40 Mel-Frequency Cepstral Coefficients (MFCCs) are extracted from this using a 1024-sample Fast Fourier Transform (FFT) window size at 30 fps (to align the audio features with the video frames).

Second, the facial features posed per frame are converted into latent key points (Landmarks2Latents in the image above).

Finally, given the input image and the per-frame latent key points predicted in the previous step, face-vid2vid, a pre-trained image-based facial animation model, outputs an animated video with frames at 512 × 512 px.

The proposed decomposition has several advantages. First, it allows fine-grained control over the output facial expressions (like winks or special head poses). In addition, latent key points can be modulated with emotion tags to change the intensity of expression or direct gaze direction. By using a pre-trained face generator, training costs are significantly reduced.

move on to the experimental part, SPACEx was trained on three different datasets (VoxCeleb2, RAVDESS and MEAD) and compared to previous work on voice-controlled animation. The metrics used for the comparison are (i) lip-sync quality, (ii) landmark accuracy, (iii) photorealism (FID score), and (iv) human scoring.

According to the results of the newspaper SPACEx achieves the lowest FID and normalized landmark distance compared to the other approaches. This is what these results indicate SPACEx produces the best image quality and achieves the highest accuracy in landmark estimation. Some of the results are reported below.

not how SPACEx, previous methods suffer from quality losses or fail with arbitrary poses. In addition, SPACEx is also able to generate missing details such as teeth, while other methods either fail or introduce artifacts.

This was a summary of SPACEx, a novel end-to-end voice-driven method of animating portraits. Check out the links below if you want to learn more about it.


Try this paper and project page. All credit for this research goes to the researchers on this project. Also don’t forget to participate our Reddit page and Discord Channelwhere we share the latest AI research news, cool AI projects and more.


Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. Candidate at the Institute for Computer Science (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning and QoS/QoE evaluation.