UniVL
UniVL copied to clipboard
An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
please tell me where the code for multimodal sentiment analysis is,thank you!
I want to only input text feature or video feature in UniVL. In this paper, it said that one transformer combines text representation **T** and video representation **V**. Could you...
There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the...
in dataloaders/README.md ``` This file is generated from `youcookii_annotations_trainval.json`, which can be downloaded from [official webpage](http://youcook2.eecs.umich.edu/download). ``` but, i download **youcookii_annotations_trainval.tar.gz** from ![image](https://user-images.githubusercontent.com/15980746/177303125-4a6c69d5-2d89-4db0-bf9d-050d81b1a17d.png) and extract youcookii_annotations_trainval.json, then found **youcookii_annotations_trainval.json has...
I followed the steps in downloading all the necessary dependencies and data to run the code. When running the code, this error is thrown: `in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command...
Hi! From your paper and readme.md file on (https://github.com/microsoft/UniVL)/dataloaders/, I could infer that the csv file you've used differ from the original csv file. It is mentioned that 1.2M videos...
Please accept this contribution adding the standard Microsoft SECURITY.MD :lock: file to help the community understand the security policy and how to safely report security issues. GitHub uses the presence...
Convert the input list of arrays to a numpy array, and negate it for further computation - code throws error otherwise.
Hello, I am trying to run your code but I keep running into issues with the distributed learning. Is it possible to run without this?
Hi, Impressive work! I want to ask how to extract features from my own video-text datasets for finetuning model?