minbpe Optimizing minbpe to also support video tokenization (extract low-dimensional latent patches from video frames)

Optimizing minbpe to also support video tokenization (extract low-dimensional latent patches from video frames)

Open Jaykef opened this issue 4 months ago • 1 comments

Hi Mentor Karpathy,

I was wondering if minbpe can be scaled to support tokenizing video frames into embedded patches: say as proposed in SORA's technical report and VIT paper - extract latent fixed-sized image patches from the video frames, linearly embed the image patches, add position embedding and then save the resulting sequence vector which can be fed later into a decoder-only transformer network like a diffusion version of your nanoGPT to generate new videos.

It might not generate well but it might be a fun exercise for the sake of learning.

Feb 26 '24 12:02 Jaykef

minbpe minbpe copied to clipboard

Optimizing minbpe to also support video tokenization (extract low-dimensional latent patches from video frames)

minbpe
minbpe copied to clipboard