NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Running container image Nemo:25.07.gpt_oss twice consecutively on the same node would result in very different Compatibility behaviors

Open tjoymeed opened this issue 3 months ago • 3 comments

On the same node:

First run - no problem: NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.9 driver version 575.51.03 with kernel driver version 570.133.20. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

Second run - ERROR: ERROR: This container was built for NVIDIA Driver Release 575.51 or later, but version 570.133.20 was detected and compatibility mode is UNAVAILABLE.

   [[]]

What happened?


mike@lg01:~$ srun --nodes=1 --ntasks=1 --gpus=1 --time=01:00:00 --partition=interactive --account=mike --export=ALL --constraint="ARCH:X86" --pty bash
srun: job 1263387 queued and waiting for resources
srun: job 1263387 has been allocated resources
bash: export: __add_sys_prefix_to_path: not a function
mike@gn12:~/training/experiments$ enroot start pyxis_nemo-25.07.gpt_oss

====================
== NeMo Framework ==
====================

NVIDIA Release  (build 211400170)
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.9 driver version 575.51.03 with kernel driver version 570.133.20.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
      detected.  Multi-node communication performance may be reduced.

Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
bash: export: __add_sys_prefix_to_path: not a function
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
mike@gn12:/workspace$ exit
exit
mike@gn12:~/training/experiments$ enroot start pyxis_nemo-25.07.gpt_oss

====================
== NeMo Framework ==
====================

NVIDIA Release  (build 211400170)
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

ERROR: This container was built for NVIDIA Driver Release 575.51 or later, but
       version 570.133.20 was detected and compatibility mode is UNAVAILABLE.

       [[]]

NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
      detected.  Multi-node communication performance may be reduced.

Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
bash: export: __add_sys_prefix_to_path: not a function
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
mike@gn12:/workspace$ 

tjoymeed avatar Sep 23 '25 04:09 tjoymeed

Hi @tjoymeed - thank you for all the issues you have raised to NeMo. May I know more about what you are trying to do? If possible, feel free to add my wechat wenwengao2014. Thank you!

snowmanwwg avatar Sep 30 '25 23:09 snowmanwwg

Hi @snowmanwwg Thanks a lot for your attention. We are exploring to optimizing training throughput. We start with replicating the results of the performance scripts of Qwen3-30B-A3B ...

https://github.com/NVIDIA-NeMo/NeMo/blob/main/scripts/performance/llm/pretrain_qwen3_30b_a3b.py

Being able to replicate the results of this script is important because we will then adapt it to our own training dataset.

Happy to e-connecting!

Thanks again!

tjoymeed avatar Oct 01 '25 02:10 tjoymeed

This sounds related to

https://github.com/NVIDIA-NeMo/NeMo/issues/14807

Could you confirm on your hardware that if you run with

--container-image=nvcr.io#nvidia/nemo:25.07.gpt_oss

twice, that it also fails? if so, then that may point to something wrong with the infrastructure not cleaning up the container file systems is creates.

terrykong avatar Oct 01 '25 19:10 terrykong