OFA
OFA copied to clipboard
can OFA support video language tasks such as video-caption?
suppose we can extract several frames from video, any suggestions about this?
Not done yet, but possible. We still need to figure out if we need to make changes on pretraining, or simply adapt the pretrained models to this task. The simplest way might be treating the average of frames as an image.