HyperTag
HyperTag copied to clipboard
Add semantic video search
First basic version: Partition video into e.g. 16 uniformly spaced (by time) sections and take a screenshot. Embed each screenshot and use average as video embedding.
Advanced: Partition video with higher granularity and extract frames e.g. every 5 seconds or fixed high number (+100). Compute embedding for every extracted frame. Compute pairwise consecutive frame distances in embedding space to infer semantically coherent video sections (similar frames). Embed each section as average of coherent frames (below a threshold). The list of average frame embeddings should be a pretty good representation of the video and comes with section start & end metadata.