Dmytro Pykhtar
Dmytro Pykhtar
jenkins
jenkins
jenkins
jenkins
I'm not seeing this issue with our latest dev container so the fix should be alredy in main. It will be included in upcoming 24.07 release container.
> > Please check #9272 > > I checked and it using https://github.com/NVIDIA/NeMo/blob/main/requirements/requirements_lightning.txt#L7 it downgraded my transformers to 4.40.2 which fixed this problem as it pulled in https://github.com/huggingface/transformers/blob/v4.40.2/src/transformers/__init__.py#L1456 Hi @raybellwaves...
Hi, setting `--no-mmap-bin-files` arg. resolves the issue: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/training/arguments.py#L2797
Hi, am I understood correctly that the configs for 4 GPUs and 8 GPUs are the same? Also, which callbacks do oyu use to log `train_step_timing in s` and `tps`....
@Proyag thanks for the scripts. Which HF model do you use?
It looks like that from logs you shared it shows you logs from GPU 0 and `train_step_timing in s` from GPU 0 respectively. I also managed to reproduce the script...