minbpe
minbpe copied to clipboard
Optimizing minbpe to also support video tokenization (extract low-dimensional latent patches from video frames)
Hi Mentor Karpathy,
I was wondering if minbpe can be scaled to support tokenizing video frames into embedded patches: say as proposed in SORA's technical report and VIT paper - extract latent fixed-sized image patches from the video frames, linearly embed the image patches, add position embedding and then save the resulting sequence vector which can be fed later into a decoder-only transformer network like a diffusion version of your nanoGPT to generate new videos.
It might not generate well but it might be a fun exercise for the sake of learning.