ammesatyajit
ammesatyajit
So for the dataset, I used the HowTo100M dataset and filtered out the cooking videos. The specific ids for only cooking videos are listed in VideoBERT/data/ids.txt. Here are the steps...
It was about 400-500 GB
Hi, so what I used was the raw_caption_superclean.json for the captions file, which you can download with the raw caption zip file I believe. I did run the other repo...
@FormerAutumn Sure! Im happy to answer any questions you have.
So for the text next token prediction, there is no video involved, and I am just using the model for next word prediction in a sentence (similar to GPT). This...
> @ammesatyajit Thanks for your kindness ! > > Do you know where to get the 'data/newest-data-max-len-20.npy' in https://github.com/MDSKUL/MasterProject/blob/master/stap5/globals.py ? > (I scan all the urls the author mentioned and...
So video_next_tok_pred takes in the tokens from the validation set. It doesn't take in video clips. Hope that answers your question.
Hi, sorry if the readme was slightly confusing. The 20736 centroids were stored in separate files due to the hierarchical k-means. The only purpose of concatenating them was so I...
@joaanna Sorry for not replying earlier. I am not going to be able to provide a detailed response because I am a little busy at the moment due to personal...
@FormerAutumn no problem. Vision transformer is really interesting, hope you find what you are looking for :)