root icon indicating copy to clipboard operation
root copied to clipboard

[tmva] Missing dependency or clean up in TMVA test/tutorials

Open pcanal opened this issue 1 year ago • 7 comments

Check duplicate issues.

  • [ ] Checked for duplicates

Description

On a large node (127 cores, 128 GB), I ran:

  1. ctest -j 32
  2. ctest --rerun-failed
  3. ctest -j 32

After 1. many test failes due to lack of resources (running out of threads, see #16552 ):

47:PyMVA-Keras-Classification
348:PyMVA-Keras-Regression
349:PyMVA-Keras-Multiclass
350:gtest-tmva-pymva-test-TestRModelParserKeras
984:tutorial-tmva-TMVA_SOFIE_GNN_Application
985:tutorial-tmva-TMVA_SOFIE_Keras
986:tutorial-tmva-TMVA_SOFIE_Keras_HiggsModel
988:tutorial-tmva-TMVA_SOFIE_RDataFrame
990:tutorial-tmva-TMVA_SOFIE_RSofieReader
1238:tutorial-tmva-RBatchGenerator_PyTorch-py
1239:tutorial-tmva-RBatchGenerator_TensorFlow-py
1246:tutorial-tmva-TMVA_SOFIE_Models-py
1247:tutorial-tmva-TMVA_SOFIE_RDataFrame-py
1252:tutorial-tmva-keras-GenerateModel-py
1253:tutorial-tmva-keras-MulticlassKeras-py

However in 2., several tests still failed (even-though resources where no longer an issue):

50:gtest-tmva-pymva-test-TestRModelParserKeras
984:tutorial-tmva-TMVA_SOFIE_GNN_Application
986:tutorial-tmva-TMVA_SOFIE_Keras_HiggsModel
988:tutorial-tmva-TMVA_SOFIE_RDataFrame
990:tutorial-tmva-TMVA_SOFIE_RSofieReader
1247:tutorial-tmva-TMVA_SOFIE_RDataFrame-py

The errors listed there included:

IncrementalExecutor::executeFunction: symbol 'saxpy_' unresolved while linking [cling interface function]!
IncrementalExecutor::executeFunction: symbol 'sgemm_' unresolved while linking [cling interface function]!
tutorials/tmva/TMVA_SOFIE_RDataFrame.C:29:10: fatal error: 'Higgs_trained_model.hxx' file not found
/tutorials/tmva/TMVA_SOFIE_GNN_Application.C:10:10: fatal error: 'encoder.hxx' file not found

From this I conclude that those tests (in particular TMVA_SOFIE_RDataFrame.C and tutorials/tmva/TMVA_SOFIE_GNN_Application.C) are missing a dependencies that failed in the first run.

Note tutorial-tmva-TMVA_SOFIE_Keras_HiggsModel and tutorial-tmva-TMVA_SOFIE_RDataFrame-py are indeed needing TMVA_Higgs_Classification.C to run first (it says so in the output! :) ).

tutorial-tmva-TMVA_SOFIE_RSofieReader is asking for Higgs_trained_model.h5

gtest-tmva-pymva-test-TestRModelParserKeras is missing the symbol sgemm_ (see below)

However when rerunning (where this time somehow there was no resource related failures), I still got several failures:

346:gtest-tmva-pymva-test-TestRModelParserPyTorch
350:gtest-tmva-pymva-test-TestRModelParserKeras
984:tutorial-tmva-TMVA_SOFIE_GNN_Application
988:tutorial-tmva-TMVA_SOFIE_RDataFrame
990:tutorial-tmva-TMVA_SOFIE_RSofieReader

all due to:

IncrementalExecutor::executeFunction: symbol 'sgemm_' unresolved while linking [cling interface function]!

or both

IncrementalExecutor::executeFunction: symbol 'saxpy_' unresolved while linking [cling interface function]!
IncrementalExecutor::executeFunction: symbol 'sgemm_' unresolved while linking [cling interface function]!

Which may be due to either a badly formed result of the failing run (1) or due to an external package that does not have the correct version number?

Reproducer

ctest -j 32 # and get lots of out of resource failures
ctest --rerun-failed
ctest -j 32

ROOT version

master

Installation method

hand build

Operating system

Alma9

Additional context

jupyter-pcanal-rootdevel:quick-devel pcanal$ bin/root-config --features
cxx17 asimage builtin_clang builtin_cling builtin_gtest builtin_llvm builtin_lz4 builtin_lzma builtin_nlohmannjson builtin_openui5 builtin_tbb builtin_vdt builtin_xxhash builtin_zlib builtin_zstd clad dataframe davix gdml http imt pyroot roofit root7 rpath runtime_cxxmodules shared sqlite ssl tmva tmva-pymva tpython spectrum vdt x11 xml xrootd

pcanal avatar Sep 27 '24 23:09 pcanal

Hi @pcanal , thanks for this report. Hopefully the solution will help also with fewer threads. I am not sure though that the unresolved while linking is due to the high thread count. Can you confirm that you do not see these errors with 8-16 threads?

dpiparo avatar Sep 28 '24 19:09 dpiparo

I am not sure though that the unresolved while linking is due to the high thread count.

I think you might be right. The best way forward is to track down where those missing symbol are suppose to come from.

pcanal avatar Sep 30 '24 13:09 pcanal

Thanks for the comment. At this point this issue seems to conflate two things:

  1. The dependencies of python tests. This should have been addressed by #16555
  2. The missing symbols.

If 1. is confirmed to be solved, I would say that at least this issue ought to be closed and one about missing symbols opened. However, even if an issue dedicated to the missing symbols is opened, it's not clear, at least to me, how the problem can be reproduced. So far we have no indication of it in our CI: can it be due to a somewhat imprecise formulation of the python dependencies in the requirements.txt file that affects your platform?

dpiparo avatar Oct 01 '24 06:10 dpiparo

Do we have perhaps a better understanding of this issue? I understand the dependencies are now fixed. Are the symbols also cured?

dpiparo avatar Oct 07 '24 06:10 dpiparo

For the symbol, I have waiting on input on which library those symbols are meant to come from.

pcanal avatar Oct 07 '24 15:10 pcanal

So I "found" that the sgemm is explicitly meant to come from a BLAS implementation and some test seem to rely on it and still run (eventhough CMakeCache.txt knows BLAS was not found.

The following 3 tests fails consistent with missing BLAS symbols

984:tutorial-tmva-TMVA_SOFIE_GNN_Application
988:tutorial-tmva-TMVA_SOFIE_RDataFrame
990:tutorial-tmva-TMVA_SOFIE_RSofieReader

but strangely more test fails with missing BLAS symbols when run in parallels:

346:gtest-tmva-pymva-test-TestRModelParserPyTorch
350:gtest-tmva-pymva-test-TestRModelParserKeras

The second part is now followed in https://github.com/root-project/root/issues/16719

The first past is now followed by https://github.com/root-project/root/issues/16720. Fixing #16720 will likely hides the problem described in #16719

pcanal avatar Oct 11 '24 18:10 pcanal

See the related failures created on the CI: https://github.com/root-project/root/pull/16664/checks?check_run_id=31435842971 where we run just the TMVA test to increase the chance of collisions .... and indeed the tutorial-tmva-TMVA_SOFIE_GNN_Application fails on most platforms with:

/github/home/ROOT-CI/src/tutorials/tmva/TMVA_SOFIE_GNN_Application.C:10:10: fatal error: 'encoder.hxx' file not found
#include "encoder.hxx"
         ^~~~~~~~~~~~~

and tutorial-tmva-TMVA_RNN_Classification-py fails (on just alma9-clang) due to timeout.


This specific issue is resolved by https://github.com/root-project/root/pull/16711

pcanal avatar Oct 12 '24 11:10 pcanal