Transcription of: Remove This! ✂️ AI-Based Video Completion is Amazing!

Dear Fellow Scholars, this is Two Minute Papers

with Dr. Károly Zsolnai-Fehér. Have you ever had a moment where you took

the perfect photo, but upon closer inspection, there was this one annoying thing that ruined

the whole picture? Well, why not just take a learning algorithm

to erase those cracks in the facade of a building, or a photobombing sheep? Or, to even reimagine ourselves with different

eye colors, we can try one of the many research works that are capable of something that we

call image inpainting. What you see here is the legendary PatchMatch

algorithm at work, which, believe it or not, is a handcrafted technique from more than

10 years ago. Later, scientists at NVIDIA published a more

modern inpainter that uses a learning-based algorithm to do this more reliably, and for

a greater variety of images. These all work really well, but the common

denominator for these techniques is that they all work on inpainting still images. Would this be a possibility for video? Like, removing a moving object or person from

a video? Is this possible, or is it science fiction? Let’s see if these learning-based techniques

can really do more. And now, hold on to your papers, because this

new work can really perform proper inpainting for video. Let’s give it a try by highlighting this

human. And pro tip: also highlight the shadowy region

for inpainting to make sure that not only the human, but its silhouette also disappears

from the footage. And, look! Wow. Let’s look at some other examples. Now that’s really something because video

is much more difficult due to the requirement of temporal coherence, which means that it’s

not nearly enough if the images are inpainted really well individually, they also have to

look good if we weave them together into a video. You will hear and see more about this in a

moment. Not only that, but if we highlight a person,

this person not only needs to be inpainted, but we also have to track the boundaries of

this person throughout the footage and then inpaint a moving region. We get some help with that, which I will also

talk about in a moment. Now, as you see here, these all work extremely

well, and believe it or not, you have seen nothing yet, because so far, another common

denominator in these examples was that we highlighted regions inside the video. But that’s not all. If you have been holding on to your papers

so far, now squeeze that paper, because we can also go outside, and expand our video

spatially with even more content. This one is very short so I will keep looping

it. Are you ready? Let’s go. Wow! My goodness! The information from inside of the video frames

is reused to infer what should be around the video frame, and all this in a temporally

coherent manner. Now, of course, this is not the first technique

to perform this, so let’s see how it compares to the competition by erasing this bear from

the video footage. The remnants of the bear are visible with

a wide selection of previously published techniques from the last few years. This is true even for these four methods from

last year. And, let’s see how this new method did one

the same case. Yup, very good, not perfect, we still see

some flickering. This is the temporal coherence example, or

the lack thereof that I have promised earlier. But now, let’s look at this example with

the BMX rider. We see similar performance with the previous

techniques, and now, let’s have a look at the new one. Now that’s what I’m talking about! Not a trace left from this person, the only

clue that we get in reconstructing what went down here is the camera movement. It truly feels like we are living in a science

fiction world. What a time to be alive! Now these were the qualitative results, and

now, let’s have a look at the quantitative results. In other words, we saw the videos, now let’s

see what the numbers say. We could talk all day about the peak signal

to noise ratios or structural similarity or other ways to measure how good these techniques

are, but you will see in a moment that it is completely unnecessary. Why is that? Well, you see here that the second best results

are underscored and highlighted with blue. As you see, there is plenty of competition,

as the blues are all over the place. But there is no competition at all for the

first place, because this new method smokes the competition in every category. This was measured on a dataset by the name

Densely Annotated Video Segmentation, DAVIS in short, this contains 150 video sequences

and it is annotated, which means that many of the objects are highlighted throughout

this video, so for the cases in this dataset, we don’t have to deal with the tracking

ourselves. I am truly out of ideas as to what I should

wish for two more papers down the line. Maybe not only removing the tennis player,

but putting myself in there as a proxy? We can already grab a controller and play

as if we were real characters in real broadcast footage, so who really knows. Anything is possible. Let me know in the comments what you have

in mind for potential applications and what you would be excited to see two more papers

down the line! Thanks for watching and for your generous

support, and I'll see you next time!