Running container image Nemo:25.07.gpt_oss twice consecutively on the same node would result in very different Compatibility behaviors
On the same node:
First run - no problem: NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.9 driver version 575.51.03 with kernel driver version 570.133.20. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
Second run - ERROR: ERROR: This container was built for NVIDIA Driver Release 575.51 or later, but version 570.133.20 was detected and compatibility mode is UNAVAILABLE.
[[]]
What happened?
mike@lg01:~$ srun --nodes=1 --ntasks=1 --gpus=1 --time=01:00:00 --partition=interactive --account=mike --export=ALL --constraint="ARCH:X86" --pty bash
srun: job 1263387 queued and waiting for resources
srun: job 1263387 has been allocated resources
bash: export: __add_sys_prefix_to_path: not a function
mike@gn12:~/training/experiments$ enroot start pyxis_nemo-25.07.gpt_oss
====================
== NeMo Framework ==
====================
NVIDIA Release (build 211400170)
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.9 driver version 575.51.03 with kernel driver version 570.133.20.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
detected. Multi-node communication performance may be reduced.
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
bash: export: __add_sys_prefix_to_path: not a function
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
mike@gn12:/workspace$ exit
exit
mike@gn12:~/training/experiments$ enroot start pyxis_nemo-25.07.gpt_oss
====================
== NeMo Framework ==
====================
NVIDIA Release (build 211400170)
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).
ERROR: This container was built for NVIDIA Driver Release 575.51 or later, but
version 570.133.20 was detected and compatibility mode is UNAVAILABLE.
[[]]
NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
detected. Multi-node communication performance may be reduced.
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
bash: export: __add_sys_prefix_to_path: not a function
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
Error while loading conda entry point: conda-anaconda-tos (No module named 'pydantic_core._pydantic_core')
mike@gn12:/workspace$
Hi @tjoymeed - thank you for all the issues you have raised to NeMo. May I know more about what you are trying to do? If possible, feel free to add my wechat wenwengao2014. Thank you!
Hi @snowmanwwg Thanks a lot for your attention. We are exploring to optimizing training throughput. We start with replicating the results of the performance scripts of Qwen3-30B-A3B ...
https://github.com/NVIDIA-NeMo/NeMo/blob/main/scripts/performance/llm/pretrain_qwen3_30b_a3b.py
Being able to replicate the results of this script is important because we will then adapt it to our own training dataset.
Happy to e-connecting!
Thanks again!
This sounds related to
https://github.com/NVIDIA-NeMo/NeMo/issues/14807
Could you confirm on your hardware that if you run with
--container-image=nvcr.io#nvidia/nemo:25.07.gpt_oss
twice, that it also fails? if so, then that may point to something wrong with the infrastructure not cleaning up the container file systems is creates.