tkdcjf159
Results
1
issues of
tkdcjf159
When training a language model (LM) with DeepSpeed's Sequence Parallel (Ulysses), it's typical to get a cross-entropy loss for each rank. To compute the gradients accurately, as [I understand it,...
bug
training