The artificial intelligence (AI) field has been busy managing the burst of generative models for the past few months. The open-source release of Stable Diffusion was the spark that ignited the never-ending flame of generative text-to-image models.
Despite the great success of image generation models, generating time-consistent long videos still remains a challenge. Phenaki has been the most successful example of text-to-video, and even it fails to maintain consistency in certain scenarios.
Video prediction models have come a long way in recent years thanks to advances in neural networks and GPUs. They can produce various examples of complex frames that are close to their original concept/video. We now have models that can generate short videos based on the previous frames.
Unfortunately, the same does not apply to long videos. Sliding short context windows can actually be used to predict frames using existing prediction models and you will probably get an impressive result at first glance. However, these videos lack temporal consistency because the window size is not long enough to accommodate long-term dependencies between frames.
A time-consistent video is important for an engaging visual experience. Imagine we’re looking at a predicted scene, then we zoom in on a certain part of it, and when we zoom out the scene is completely changed because we don’t have a temporally consistent model. That would be annoying to watch.
The other important aspect would be a strong imagination of the predictive model. We’d like to see a different setup when we change the scene and not have the same objects everywhere. Therefore, the ideal video prediction we want should be consistent over time and have a strong imagination for new scenes. But how close can we get to this ideal scenario? Time to meet TECO.
TECO is a vector-quantized latent dynamics model that can effectively model long-term dependencies by using efficient transformers in a compact representation space. It shows strong performance on a variety of difficult video prediction tasks. This is made possible by the ability to understand long-term temporal dependencies in the video.
TECO uses efficient representation of frames and proper use of transformers to allow temporal consistency between frames. The efficient representation vectors ensure that TECO can significantly reduce computational and storage requirements.
It starts with a GAN (Generative Adversarial Network) model that is trained to spatially compress the video data. This has already been done in the literature and has been shown to increase the efficiency of video prediction models. However, even after moving the video into latent space, previous methods were still limited to modeling short sequences due to the extremely high cost of Transformer layers. TECO finds an intelligent solution to this problem to allow the use of transformers for longer video sequences and thus maintain temporal consistency. A custom loss function called DropLoss is also used to efficiently train the model.
To demonstrate the power of TECO, the authors introduced three challenging video datasets to better measure long-term consistency. They were built on existing benchmarks. In the experiments, TECO showed strong temporal consistency and achieved high-quality frame generation.
This was a short summary of TECO. Check out the links below if you want to learn more about it.
Try this paper Project, and code. All credit for this research goes to the researchers on this project. Also don’t forget to participate our Reddit page and Discord Channelwhere we share the latest AI research news, cool AI projects and more.
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. Dissertation on image denoising using deep convolutional networks. He is currently pursuing a Ph.D. Graduated from the University of Klagenfurt, Austria and works as a researcher in the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.