Dezhao Song
Dezhao Song
Thanks a lot for the quick response! I guess I had one misunderstanding before but just to clarify: Is hypothesis the concatenation of question and answer or is it answer...
Got it and thanks again. I will use the first choice. I agree it makes more sense and also requires less changes to the data processing code, i.e., I can...
Thanks a lot
Thanks again for your help and I am now able to train the model with MAN (i.e., BertForMultipleChoice_SAN). Just one quick question. When training, in the log, I see this...
I see. Thanks.
Hello @delock , wondering whether you could also take a look at this one? Thanks.
Hello @delock , thanks for the pull request. I tested this and yes, it worked for the dense models (e.g., qwen3-8b and qwen3-32b). However, it still failed on the MoE...
Hello @loadams , could we re-open this? When I was testing the Qwen3-MoE models, I still got the same error. The dense models work fine.
@ranzhejiang , thanks for the fix. I tested this commit and it worked when I use "sdpa" attention. However, if I change the attention to "flash_attention_2", I still got an...
@loadams : please see my additional test above.