root icon indicating copy to clipboard operation
root copied to clipboard

roottest running out of threads !?

Open pcanal opened this issue 1 year ago • 5 comments

Check duplicate issues.

  • [ ] Checked for duplicates

Description

When running with ctest -j 32 on a node with 127 cores (see below for more details), one of the run had many failures due to running out of thread resources. The list of affected test includes:

47:PyMVA-Keras-Classification                                
348:PyMVA-Keras-Regression 
349:PyMVA-Keras-Multiclass  
985:tutorial-tmva-TMVA_SOFIE_Keras
1238:tutorial-tmva-RBatchGenerator_PyTorch-py  
1239:tutorial-tmva-RBatchGenerator_TensorFlow-py   
1247:tutorial-tmva-TMVA_SOFIE_RDataFrame-py        
1252:tutorial-tmva-keras-GenerateModel-py       
1253:tutorial-tmva-keras-MulticlassKeras-py       
1584:roottest-root-io-evolution-make              
1641:roottest-root-io-newstl-make

those (and possibly tutorial-tmva-keras-MulticlassKeras-py which did not run because it requires the previous test)

Reproducer

347/2278 Testing: PyMVA-Keras-Classification
347/2278 Test: PyMVA-Keras-Classification
Command: "/usr/bin/cmake" "-DCMD=/home/pcanal/root_working/build/quick-devel/tmva/pymva/test/testPyKerasClassification" "-DSYS=/home/pcanal/root_working/build/quick-devel" "-P" "/home/pcanal/root_working/code/quick-devel/cmake/modules/RootTestDriver.cmake"
Directory: /home/pcanal/root_working/build/quick-devel/tmva/pymva/test
"PyMVA-Keras-Classification" start time: Sep 24 20:01 UTC
Output:
----------------------------------------------------------
Get test data...
Generate keras model...
2024-09-24 20:01:12.572604: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-24 20:01:12.572668: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-24 20:01:12.573914: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-24 20:01:12.581129: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-24 20:01:15.157134: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
2024-09-24 20:01:26.401521: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
[ERROR] Failed to generate model using python
CMake Error at /home/pcanal/root_working/code/quick-devel/cmake/modules/RootTestDriver.cmake:232 (message):
  error code: 1


<end of output>
Test time =  54.61 sec
----------------------------------------------------------
Test Failed.
"PyMVA-Keras-Classification" end time: Sep 24 20:02 UTC
"PyMVA-Keras-Classification" time elapsed: 00:00:54

Other errors:

14323:    system_error: Resource temporarily unavailable
614356:/bin/sh: fork: retry: Resource temporarily unavailable
614357:/bin/sh: fork: retry: Resource temporarily unavailable
614358:/bin/sh: fork: retry: Resource temporarily unavailable
614359:/bin/sh: fork: retry: Resource temporarily unavailable
614360:/bin/sh: fork: Resource temporarily unavailable
614444:/bin/sh: fork: retry: Resource temporarily unavailable
614445:/bin/sh: fork: retry: Resource temporarily unavailable
614446:/bin/sh: fork: retry: Resource temporarily unavailable
614447:/bin/sh: fork: retry: Resource temporarily unavailable
616571:LLVM ERROR: pthread_create failed: Resource temporarily unavailable
616573:sh: fork: retry: Resource temporarily unavailable
616574:sh: fork: retry: Resource temporarily unavailable
616575:sh: fork: retry: Resource temporarily unavailable
616576:sh: fork: retry: Resource temporarily unavailable
616577:sh: fork: Resource temporarily unavailable

ROOT version

master

Installation method

hand build

Operating system

Alma9

Additional context

Node is VM with 128GB of RAM and is access via Jupyter notebook.

jupyter-pcanal-rootdevel:quick-devel pcanal$ uname -a
Linux jupyter-pcanal-rootdevel 6.3.12-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul  6 04:05:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
CPU(s):                  127
  On-line CPU(s) list:   0-126
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7543 32-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  1
    Core(s) per socket:  1

pcanal avatar Sep 27 '24 22:09 pcanal

Thanks for this report. These errors refer to fork: are we sure the resource we are lacking are threads and not PIDs? Is the configuration of the machine "sane", i.e. allowing an adequate number of subprocesses per process?

dpiparo avatar Sep 28 '24 19:09 dpiparo

It looks okay:

$ cat /proc/sys/kernel/threads-max
7897651
$ cat /proc/sys/kernel/pid_max 
4194304
$ cat /proc/sys/vm/max_map_count
262144
jupyter-pcanal-rootdevel:quick-devel pcanal$ ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 3948825
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 4194304
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

pcanal avatar Sep 30 '24 13:09 pcanal

Ok, I think we have at least 2 problems here. The first is related to these errors Unable to register cuDNN/cuFFT/cuBLAS factory: Attempting to register factory for plugin cuDNN/cuFFT/cuBLAS when one has already been registered For those, I propose you set up your machine following the hints on this thread https://github.com/tensorflow/tensorflow/issues/62075 (it's a bug).

As for fork: retry: Resource temporarily unavailable, again it looks like a configuration matter relative to the node. Some research shows pages like this one https://unix.stackexchange.com/questions/205016/fork-retry-resource-temporarily-unavailable, that hints to configurations like the one in /etc/sysctl.conf .

All in all, I am inclined to consider this item relative to the platform at hand and not to ROOT.

dpiparo avatar Oct 01 '24 06:10 dpiparo

Just trying to understand whether more information is available about this item. I would like to find out whether this is an issue of ROOT(test) or the configuration of the machine...

dpiparo avatar Oct 07 '24 06:10 dpiparo

Hi @pcanal, can you check if the situation is better with https://github.com/root-project/root/pull/16717 merged?

guitargeek avatar Oct 19 '24 10:10 guitargeek

Yes, it looks better (I can no longer reproduce those failures).

pcanal avatar Nov 14 '24 20:11 pcanal