buptzyb
buptzyb
Thanks for the review, Changhui! Improved the coding style according to your advice.
Hi @changhuilin , what's our next step on this PR? Thank you!
@changhuilin Thank you for taking care of this PR! Updated the description and merged to the head.
@changhuilin So, what's our next move on this?
As of today, `--external-cuda-graph` must go with `--te-rng-tracker`. I suspect your phase 3 error is still an OOM-caused strange behavior. Could you make some mini tests first such as running...
These are my arguments running 8*7b cudagraph. But I tested with 4 nodes: `--position-embedding-type rope --normalization RMSNorm --swiglu --no-position-embedding --no-masked-softmax-fusion --tokenizer-type Llama2Tokenizer --tokenizer-model xxxxx/mixtral-tokenizer.model --ffn-hidden-size 14336 --group-query-attention --num-query-groups 8 --num-layers...
Correct, you need to pass `io_memory_reduction = True` to [make_graphed_callables](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/training/training.py#L630) to enable it. Your error seems so weird, I cannot think of a reason why the old and new data...
Hi @Baibaifan , I tested with your configuration on my side, and everything goes well... Here is the memory log, cudagraph (orange) takes about 2GB more memory than non-cudagraph (green):...
What's the throughput when the moe balance loss is low enough? If you compare the throughput just at the beginning steps, the numbers may be unreasonable.
Made some tests to find that this problem is in