Building a web-based real-time video editing tool with machine learning

Anastasis Germanidis

November 17, 2020

by Anastasis Germanidis | November 17, 2020

The product and the challenges

Green Screen works as follows: after uploading a video containing a shot of the object you’d like to mask, you begin the masking process by clicking on the object in one frame of the video, which generates an initial mask. To further refine this mask, you can add “include” clicks to incorporate additional regions of the frame to the mask, or “exclude” clicks to remove erroneously masked regions. Once you are satisfied with the mask generated for that initial frame, you can “apply” the mask to the rest of the video to generate a continuous mask stream that follows the object as it moves through the shot. If needed, you can add more clicks in other parts of the video to fix any mistakes you see, repeating this process until you are satisfied with the results for the entire video. Finally, you can export a video of the selected object with a chroma key background that you can then import to other video editing software for further processing.

We solved a variety of interesting technical problems across our stack while developing this feature: (1) generating a high-quality initial mask from only a few clicks on a single frame, (2) propagating that initial mask to the rest of the video in a temporally consistent manner, (3) generating the resulting mask stream on our backend and sending it to the client quickly enough to allow the user to preview the results within seconds, and (4) efficiently and correctly rendering the resulting mask on the frontend so the user can spot any errors and add more clicks to fix them. Let’s walk through these problems one by one.

Refining a mask with “include” and “exclude” clicks.

Generating an initial mask from clicks

We start from the challenge of generating a high-quality initial mask for an object with as little effort from the user as possible. To enable our user-guided object selection workflow, we trained an interactive segmentation model based on the UNet architecture, which we call the Refinement Network. The model learns to generate masks for arbitrary objects from a handful of clicks from the user; note that this is different from the more common semantic segmentation task, where the objective is to segment objects from a predefined set of categories (person, car, etc.) without any user guidance. To train the Refinement Network, we generated a synthetic paired dataset of user clicks and resulting masks, using a probabilistic model to simulate how a user would behave while refining a mask on the Green Screen interface.

Propagating masks through time, in real-time

Once the user has finished refining the mask on a particular frame, they can “apply” their changes to the rest of the video to generate a temporally consistent mask stream. For this task, we employed a different neural network, which we call the Propagation Network. The network learns to generate a continuous series of masks for each frame, tracking the object as it moves through the video, by finding correspondences with the “keyframe masks” that the user has created using the Refinement Network.

An issue we faced when initially testing our Propagation Network was that inference was too slow for high-resolution videos to enable the fast preview capabilities that we were aiming for. Our initial solution was to always generate masks at a very low resolution (360p); this enabled us to output results much faster than real-time but led to a significantly degraded user experience since the resulting mask was low quality and unable to track smaller objects. Instead, to speed up inference, we converted the model from its original implementation in PyTorch to TensorRT, a high-performance inference framework for NVIDIA GPUs. Using TensorRT on NVIDIA V100 GPUs, we saw approximately a 4.5x speedup compared to the PyTorch implementation. Combined with further optimizations, we were able to perform inference in real-time on 720p resolution at 15.8ms per frame (≈ 63 FPS), enabling a tight feedback loop between the user and the segmentation model without sacrificing visual fidelity.

Machine learning meets streaming technology

To prevent re-generating the mask stream for the entire video on each user interaction, we devised a system for processing the video and sending the results to the client in a piecemeal manner, using an HTTP-based streaming protocol similar to HLS. When the user tries to preview the results of the masking process on a specific segment of the video, we send a request to generate the mask output for only that segment. To efficiently decode the input video segment, run inference on the GPU, and encode the result to send to the client quickly enough to prevent long buffering delays, we needed to identify and solve more than a few bottlenecks along the way, and we’re hoping to discuss our streaming inference approach in greater detail in future blog posts.

Foreground vs Background mask visualization.

Synchronizing video and mask streams on the frontend

After retrieving the mask video segment from the streaming server, the frontend provides two ways of visualizing the output to users: (1) displaying the mask as a semi-transparent overlay on the input video (ideal for identifying regions of the frame to be included or excluded), or (2) hiding the background to isolate the selected object (ideal for previewing the final result of the masking process). Since the original video stream and the mask stream are stored as two separate videos, we needed to find a way to composite them during playback while making sure they remain in sync at all times, i.e. avoiding any situations where the displayed mask was a few frames ahead or behind the original video. In order to have greater control over frame timing during playback, and solve any synchronization issues that we faced when compositing the video and mask stream, we decided to avoid using the HTML5 Video API altogether, and instead ship a WebAssembly-compiled codec to decode both video streams and upload the decoded frames as WebGL textures for compositing.

Green Screen is the first of a series of machine learning-based tools that we’ll be releasing around video creation. Are you interested in helping us build the next ones? We’re growing our engineering team across the board, so if any of the challenges described in this blog post feel exciting or relevant to you, please reach out!