Deep Video Summarization

Last updated on Oct 6, 2023

Summary

Given an input video data, we find the most informative slides and summarize the content in the video in the form of natural language.
We first encode the images using the CLIP model and then pass it through a U-Net inspired transformer encoder-decoder architecture with skip connections in order to score each frame.
Finally, the frame-level scores to shot-level scores and finally use dynamic programming (0/1 knapsack) to decide which shots to pick as keyshots.