VILA
VILA copied to clipboard
Long context video module only
Great works and research.
My question is simply if is it possible to use only the visual/video part (already pretrained on video dataset like kinetics) for fine-tuning on long video dataset e.g. to classify 1-minute or 2-minutes of video data.