trl
trl copied to clipboard
DataCollatorForCompletionOnlyLM does not work with FSDP
fsdp_qlora.txt
The loss is returned as NaN when using DataCollatorForCompletionOnlyLM with the FSDP pipeline (attached for reference)
"could not find instruction key [882] in the following instance: <|start_header_id|>user<|end_header_id|> "
Upon checking the collator function in a separate jupyter notebook, I dont see this error.
Is this something to do with the distributed learning approach?
I followed the FSDP approach as mentioned here:
https://www.philschmid.de/fsdp-qlora-llama3
Can anyone suggest what is it that I am missing here?