XLzed

Results 11 comments of


                                            XLzed

[QUESTION] Training Mixtral 8x7B on 16 x H100 only achieves low throughput of 130 TFLOPS

> Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions: > > 1. Update the...

[QUESTION] Should llama or gpt-like models have padding attention mask?

The megatron GPTDataset is implemented as batch-packing like https://huggingface.co/docs/trl/sft_trainer#packing-dataset--constantlengthdataset- which provides samples with constant length, but I don't know the performance effect of packing compared to padding, or maybe reset-attention-mask...

[BUG] Some data is lost during transmission.

> Can you please provide some information on your test system? Especially, which type of network interconnect are you using (Ethernet, InfiniBand, etc.)? The only error I recognize is "Stream...

[BUG] Some data is lost during transmission.

If I force the sendTaggedMessage to be blocking, the examples works fine. ``` // final boolean completed = endpoint.sendTaggedMessage(sendBuffer.memoryAddress() + index, messageLength, tag, true, blocking); final boolean completed = endpoint.sendTaggedMessage(sendBuffer.memoryAddress()...

[BUG] Some data is lost during transmission.

It seems that tag matching semantic is not completed in order strictly. Maybe we have to deal with out-of-order, or use another semantic of UCX? I don't know if the...

[BUG] Some data is lost during transmission.

> I don't know how to explain this, but when I set DEFAULT_BUFFER_SLICE_LENGTH to 16K, which is the maxium size of data frame in HTTP2, the problem disappeared. There are...

[QUESTION] implementation of `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id`

> Thanks a lot for your comments. We will supplement these two functions soon. Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p overlap? I add...

[QUESTION] implementation of `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id`

> > Thanks a lot for your comments. We will supplement these two functions soon. > > Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p...

[BUG] Is mcore-0.12.0 checkpoint resume correct?

> I think I may have figured this out. > > Try setting `make_vocab_size_divisible_by` to `256`, which will reduce the chance of saving a corrupted checkpoint compared to the default...

[BUG] Is mcore-0.12.0 checkpoint resume correct?

> maybe this could help? i'm not sure🤔 https://github.com/alibaba/Pai-Megatron-Patch/tree/main/megatron_patch/fixes/optimizer_offloading Thanks, but I did not use optimizer offloading, will this also affect the native usage? I'll have a try in my...

1
2
›