Ryota Tanaka
Ryota Tanaka
Hi, I solved this problem to replace [this line](https://github.com/clovaai/donut/blob/1.0.7/donut/util.py#L64) by reading only metadata.jsonl and loading images during training in `__getitem__`. Thanks for your advice.
@kebijuelun I think the provided code does not support taking the question into the Q-Former. https://github.com/salesforce/LAVIS/issues/198
@kebijuelun In Table 4 of the original paper, Flan-T5xl model requires 1.2B params in fine-tuining VQAv2 task. (But, 1.1B params are required in image captioning task.) It suspects that Faln-T5xl...
Hi, thank you for your interest in our work :) Currently, we have a plan to release the code for M3D until one month later.