Image and video editing are two of the most popular applications for computer users. With the advent of Machine Learning (ML) and Deep Learning (DL), image and video manipulation has been progressively explored through multiple neural network architectures. Until recently, most DL models were supervised for image and video processing, and specifically required that the training data contained pairs of input and output data used to learn the details of the desired transformation. More recently, end-to-end learning frameworks have been proposed that require only a single image as input to learn mapping to the desired edited output.
Video matting is a special task that belongs to video editing. The term ‘frosting’ dates back to the 19th century, when glass plates with frosted paint were placed in front of a camera during filming to create the illusion of an environment that was not present on location. Today, the composition of multiple digital images follows a similar process. A composite formula is exploited to shade the foreground and background intensity of each image, expressed as a linear combination of the two components.
Although this process is really powerful, it has some limitations. It requires a unique factorization of the image into foreground and background layers, which are then assumed to be independently treatable. In some situations such as video matting, i.e. a sequence of time- and location-dependent individual images, the layering becomes a complex task.
The goals of this work are the elucidation of this process and the increase of the decomposition accuracy. The authors propose factor matting, a variant of the matting problem that decomposes video into more independent components for downstream editing tasks. To address this problem, they then present FactorMatte, an easy-to-use framework that combines classic matting priorities with conditional ones based on expected deformations in a scene. Thus, the classical Bayesian formulation, which relates to the estimation of the maximum posterior probability, is extended to remove the restrictive assumption of foreground and background independence. The majority of approaches also assume that background layers remain static over time, which is seriously limiting for most video sequences.
To overcome these limitations, FactorMatte relies on two engines: a decomposition network that decomposes the input video into one or more layers for each component, and a set of patch-based discriminators that represent conditional priorities for each component. The architecture pipeline is shown below.
The input to the decomposition network consists of a video and a coarse segmentation mask for the object of interest frame by frame (left, yellow box). With this information, the network generates color and alpha layers (middle, green, and blue boxes) based on reconstruction loss. The foreground layer models the foreground component (right, green
box), while the environment layer and the remainder layer together model the background component (right, blue box). The environment layer represents the static aspects of the background, while the residual layer captures more erratic changes in the background component due to interactions with the foreground objects (the pincushion in the figure). A discriminator was trained for each of these layers in order to learn the respective marginal priorities.
The matting result for some selected samples is shown in the figure below.
Although FactorMatte is not perfect, the results achieved are significantly more accurate than the baseline approach (OmniMatte). In all of the specified patterns, the background and foreground levels represent a clean separation, which cannot be said for the comparison solution. In addition, ablation studies were performed to prove the effectiveness of the proposed solution.
This was the summary of FactorMata novel framework to address the video mat Problem. If you are interested, see the links below for more information.
Try this paper, Code, and Project All credit for this research goes to the researchers on this project. Also don’t forget to participate our Reddit page and Discord Channelwhere we share the latest AI research news, cool AI projects and more.
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. Candidate at the Institute for Computer Science (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning and QoS/QoE evaluation.