Zineng Tang
Zineng Tang
@amyeroberts Hey thanks for your suggestions! But we cannot move masks to TvltForPreTraining since it has to be done in TvltModel. If we move masks from embedding to TvltModel, then...
OK I found that all my past reviews are 'pending' and maybe they were never sent out lol, which is my bad. Anyway, I addressed the comments and let me...
@NielsRogge what do you think about current state. Is there anything else left to address? Thanks!
@NielsRogge now I addressed the remaining comments. It makes sense to me that TvltForQuestionAnswering is not needed since it is the same as TvltForAudioVisualClassification.
@amyeroberts Sounds great. Btw there seems to be a fail from other models FAILED tests/models/hubert/test_modeling_tf_hubert.py::TFHubertRobustModelTest::test_dataset_conversion Do you think it comes from this branch or main branch?
@amyeroberts Now it passed the tests! Thanks so much for the help/suggestions all the way. :)
refer to this PR https://github.com/microsoft/i-Code/pull/36 for more details/demo before it is merged.
We are a generative open vocabulary model so the answer should be yes. We have a text decoder to generate text so there is no reason UDOP can't generate text.
Rvl-CDIP is a part of IIT-CDIP People use many kinds of OCR engines like Microsoft, Tesseract, etc. You can find IIT-CDIP here and only use its Rvl-CDIP portion. https://data.nist.gov/od/id/mds2-2531 https://zenodo.org/record/6540454#.Y7ceCuzMI0Q
Some rough details are provided in the paper. You will need at least 16 V100 or enough GPU with above 30GB mem to train arond 1 week for some epochs.