Ethan He

Results 36 comments of Ethan He

You need to use mcore models. local is deprecating

generally, mask will be created inside transformer engine if `--use-mcore-models`

you can use --data-cache-path to specify where you want to cache. And precompute it using a single node. https://github.com/NVIDIA/Megatron-LM/blob/9de386d08770d7296263a590171ace4ae45348ad/megatron/training/arguments.py#L1349-L1350

tbh, I don't exactly remember the details. You can try remove stopgrad and compare the peformance