Zineng Tang comments

Results 39 comments of


                                            Zineng Tang

Add TVLT

@amyeroberts Hey thanks for your suggestions! But we cannot move masks to TvltForPreTraining since it has to be done in TvltModel. If we move masks from embedding to TvltModel, then...

Add TVLT

OK I found that all my past reviews are 'pending' and maybe they were never sent out lol, which is my bad. Anyway, I addressed the comments and let me...

Add TVLT

@NielsRogge what do you think about current state. Is there anything else left to address? Thanks!

Add TVLT

@NielsRogge now I addressed the remaining comments. It makes sense to me that TvltForQuestionAnswering is not needed since it is the same as TvltForAudioVisualClassification.

@amyeroberts Sounds great. Btw there seems to be a fail from other models FAILED tests/models/hubert/test_modeling_tf_hubert.py::TFHubertRobustModelTest::test_dataset_conversion Do you think it comes from this branch or main branch?

Add TVLT

@amyeroberts Now it passed the tests! Thanks so much for the help/suggestions all the way. :)

Demos on actual documents

refer to this PR https://github.com/microsoft/i-Code/pull/36 for more details/demo before it is merged.

Can the provided models perform multi-label classification?

We are a generative open vocabulary model so the answer should be yes. We have a text decoder to generate text so there is no reason UDOP can't generate text.

where is the rvl-cdip dataset

Rvl-CDIP is a part of IIT-CDIP People use many kinds of OCR engines like Microsoft, Tesseract, etc. You can find IIT-CDIP here and only use its Rvl-CDIP portion. https://data.nist.gov/od/id/mds2-2531 https://zenodo.org/record/6540454#.Y7ceCuzMI0Q

Computation resources and traning time

Some rough details are provided in the paper. You will need at least 16 V100 or enough GPU with above 30GB mem to train arond 1 week for some epochs.