Transcription of: Remove This! ✂️ AI-Based Video Completion is Amazing!
Dear Fellow Scholars, this is Two Minute Papers
with Dr. Károly Zsolnai-Fehér. Have you ever had a moment where you took
the perfect photo, but upon closer inspection, there was this one annoying thing that ruined
the whole picture? Well, why not just take a learning algorithm
to erase those cracks in the facade of a building, or a photobombing sheep? Or, to even reimagine ourselves with different
eye colors, we can try one of the many research works that are capable of something that we
call image inpainting. What you see here is the legendary PatchMatch
algorithm at work, which, believe it or not, is a handcrafted technique from more than
10 years ago. Later, scientists at NVIDIA published a more
modern inpainter that uses a learning-based algorithm to do this more reliably, and for
a greater variety of images. These all work really well, but the common
denominator for these techniques is that they all work on inpainting still images. Would this be a possibility for video? Like, removing a moving object or person from
a video? Is this possible, or is it science fiction? Let’s see if these learning-based techniques
can really do more. And now, hold on to your papers, because this
new work can really perform proper inpainting for video. Let’s give it a try by highlighting this
human. And pro tip: also highlight the shadowy region
for inpainting to make sure that not only the human, but its silhouette also disappears
from the footage. And, look! Wow. Let’s look at some other examples. Now that’s really something because video
is much more difficult due to the requirement of temporal coherence, which means that it’s
not nearly enough if the images are inpainted really well individually, they also have to
look good if we weave them together into a video. You will hear and see more about this in a
moment. Not only that, but if we highlight a person,
this person not only needs to be inpainted, but we also have to track the boundaries of
this person throughout the footage and then inpaint a moving region. We get some help with that, which I will also
talk about in a moment. Now, as you see here, these all work extremely
well, and believe it or not, you have seen nothing yet, because so far, another common
denominator in these examples was that we highlighted regions inside the video. But that’s not all. If you have been holding on to your papers
so far, now squeeze that paper, because we can also go outside, and expand our video
spatially with even more content. This one is very short so I will keep looping
it. Are you ready? Let’s go. Wow! My goodness! The information from inside of the video frames
is reused to infer what should be around the video frame, and all this in a temporally
coherent manner. Now, of course, this is not the first technique
to perform this, so let’s see how it compares to the competition by erasing this bear from
the video footage. The remnants of the bear are visible with
a wide selection of previously published techniques from the last few years. This is true even for these four methods from
last year. And, let’s see how this new method did one
the same case. Yup, very good, not perfect, we still see
some flickering. This is the temporal coherence example, or
the lack thereof that I have promised earlier. But now, let’s look at this example with
the BMX rider. We see similar performance with the previous
techniques, and now, let’s have a look at the new one. Now that’s what I’m talking about! Not a trace left from this person, the only
clue that we get in reconstructing what went down here is the camera movement. It truly feels like we are living in a science
fiction world. What a time to be alive! Now these were the qualitative results, and
now, let’s have a look at the quantitative results. In other words, we saw the videos, now let’s
see what the numbers say. We could talk all day about the peak signal
to noise ratios or structural similarity or other ways to measure how good these techniques
are, but you will see in a moment that it is completely unnecessary. Why is that? Well, you see here that the second best results
are underscored and highlighted with blue. As you see, there is plenty of competition,
as the blues are all over the place. But there is no competition at all for the
first place, because this new method smokes the competition in every category. This was measured on a dataset by the name
Densely Annotated Video Segmentation, DAVIS in short, this contains 150 video sequences
and it is annotated, which means that many of the objects are highlighted throughout
this video, so for the cases in this dataset, we don’t have to deal with the tracking
ourselves. I am truly out of ideas as to what I should
wish for two more papers down the line. Maybe not only removing the tennis player,
but putting myself in there as a proxy? We can already grab a controller and play
as if we were real characters in real broadcast footage, so who really knows. Anything is possible. Let me know in the comments what you have
in mind for potential applications and what you would be excited to see two more papers
down the line! Thanks for watching and for your generous