Deep Video Summarization

Summary

  • Given an input video data, we find the most informative slides and summarize the content in the video in the form of natural language.
  • We first encode the images using the CLIP model and then pass it through a U-Net inspired transformer encoder-decoder architecture with skip connections in order to score each frame.
  • Finally, the frame-level scores to shot-level scores and finally use dynamic programming (0/1 knapsack) to decide which shots to pick as keyshots.
Susim Mukul Roy
Susim Mukul Roy
MS Student

My main goal is to build trustable ai integrated robotic systems.