unilm
unilm copied to clipboard
Question for BeiT3 about pre-training
Thank you very much for your work. Recently, I have been studying the BEiT3 model, but I have some questions and I would appreciate it if I could get some answers:
- Is the training dataset shuffled during the training phase? Does it include both image, text, and image-text pairs?
- If the dataset is mixed, how are different experts selected during the training process? The MoE structure typically includes a gating network, but it seems that this model does not have such a gating network. I look forward to receiving your response.
When users fine-tune this model for different downstream tasks, do they need to select experts?