Deep Video Summarization
Summary
- Given an input video data, we find the most informative slides and summarize the content in the video in the form of natural language.
- We first encode the images using the CLIP model and then pass it through a U-Net inspired transformer encoder-decoder architecture with skip connections in order to score each frame.
- Finally, the frame-level scores to shot-level scores and finally use dynamic programming (0/1 knapsack) to decide which shots to pick as keyshots.