XLzed
XLzed
> Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions: > > 1. Update the...
The megatron GPTDataset is implemented as batch-packing like https://huggingface.co/docs/trl/sft_trainer#packing-dataset--constantlengthdataset- which provides samples with constant length, but I don't know the performance effect of packing compared to padding, or maybe reset-attention-mask...
> Can you please provide some information on your test system? Especially, which type of network interconnect are you using (Ethernet, InfiniBand, etc.)? The only error I recognize is "Stream...
If I force the sendTaggedMessage to be blocking, the examples works fine. ``` // final boolean completed = endpoint.sendTaggedMessage(sendBuffer.memoryAddress() + index, messageLength, tag, true, blocking); final boolean completed = endpoint.sendTaggedMessage(sendBuffer.memoryAddress()...
It seems that tag matching semantic is not completed in order strictly. Maybe we have to deal with out-of-order, or use another semantic of UCX? I don't know if the...
> I don't know how to explain this, but when I set DEFAULT_BUFFER_SLICE_LENGTH to 16K, which is the maxium size of data frame in HTTP2, the problem disappeared. There are...
> Thanks a lot for your comments. We will supplement these two functions soon. Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p overlap? I add...
> > Thanks a lot for your comments. We will supplement these two functions soon. > > Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p...
> I think I may have figured this out. > > Try setting `make_vocab_size_divisible_by` to `256`, which will reduce the chance of saving a corrupted checkpoint compared to the default...
> maybe this could help? i'm not sure🤔 https://github.com/alibaba/Pai-Megatron-Patch/tree/main/megatron_patch/fixes/optimizer_offloading Thanks, but I did not use optimizer offloading, will this also affect the native usage? I'll have a try in my...