Ethan He
Results
36
comments of
Ethan He
You need to use mcore models. local is deprecating
It's handled by TEnorm
generally, mask will be created inside transformer engine if `--use-mcore-models`
you can use --data-cache-path to specify where you want to cache. And precompute it using a single node. https://github.com/NVIDIA/Megatron-LM/blob/9de386d08770d7296263a590171ace4ae45348ad/megatron/training/arguments.py#L1349-L1350
tbh, I don't exactly remember the details. You can try remove stopgrad and compare the peformance
(1) tokens = seq_len * consumed samples