XLzed

Results 11 comments of XLzed

> Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions: > > 1. Update the...

The megatron GPTDataset is implemented as batch-packing like https://huggingface.co/docs/trl/sft_trainer#packing-dataset--constantlengthdataset- which provides samples with constant length, but I don't know the performance effect of packing compared to padding, or maybe reset-attention-mask...

> Can you please provide some information on your test system? Especially, which type of network interconnect are you using (Ethernet, InfiniBand, etc.)? The only error I recognize is "Stream...

If I force the sendTaggedMessage to be blocking, the examples works fine. ``` // final boolean completed = endpoint.sendTaggedMessage(sendBuffer.memoryAddress() + index, messageLength, tag, true, blocking); final boolean completed = endpoint.sendTaggedMessage(sendBuffer.memoryAddress()...

It seems that tag matching semantic is not completed in order strictly. Maybe we have to deal with out-of-order, or use another semantic of UCX? I don't know if the...

> I don't know how to explain this, but when I set DEFAULT_BUFFER_SLICE_LENGTH to 16K, which is the maxium size of data frame in HTTP2, the problem disappeared. There are...

> Thanks a lot for your comments. We will supplement these two functions soon. Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p overlap? I add...

> > Thanks a lot for your comments. We will supplement these two functions soon. > > Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p...

> I think I may have figured this out. > > Try setting `make_vocab_size_divisible_by` to `256`, which will reduce the chance of saving a corrupted checkpoint compared to the default...

> maybe this could help? i'm not sure🤔 https://github.com/alibaba/Pai-Megatron-Patch/tree/main/megatron_patch/fixes/optimizer_offloading Thanks, but I did not use optimizer offloading, will this also affect the native usage? I'll have a try in my...