root icon indicating copy to clipboard operation
root copied to clipboard

PyTorch test/tutorials are (likely) using the same model files.

Open pcanal opened this issue 1 year ago • 4 comments

Doing:

ctest -R tmva -j 32

will result in an arbitrary result (sometimes pass sometime fail) for

gtest-tmva-pymva-TestRModelParserKeras
gtest-tmva-pymva-TestRModelParserPyTorch 

re-running just those tests (whether they succeeded or not) will lead to both of them failing. The failure report is indicate that they 'now' need the BLAS library (which is not available on the system).

As a possible clue (or not), the following 3 test fails systemically on the system due to the missing BLAS library:

        996 - tutorial-tmva-TMVA_SOFIE_GNN_Application (Failed)
        1000 - tutorial-tmva-TMVA_SOFIE_RDataFrame (Failed)
        1002 - tutorial-tmva-TMVA_SOFIE_RSofieReader (Failed)

pcanal avatar Oct 19 '24 18:10 pcanal

It is confirms that one of those files:

-rw-r--r--. 1 pcanal us_cms   8962 Oct 19 17:56 ./tmva/pymva/test/PyTorchModelSequential.pt
-rw-r--r--. 1 pcanal us_cms  11913 Oct 19 17:56 ./runtutorials/modelClassification.pt
-rw-r--r--. 1 pcanal us_cms  10564 Oct 19 17:56 ./runtutorials/PyTorchModel.pt
-rw-r--r--. 1 pcanal us_cms  10941 Oct 19 17:56 ./runtutorials/modelMultiClass.pt
-rw-r--r--. 1 pcanal us_cms  11330 Oct 19 17:56 ./runtutorials/trainedModelMultiClass.pt
-rw-r--r--. 1 pcanal us_cms  12110 Oct 19 17:56 ./runtutorials/trainedModelClassification.pt
-rw-r--r--. 1 pcanal us_cms   7853 Oct 19 17:57 ./runtutorials/modelRegression.pt
-rw-r--r--. 1 pcanal us_cms   7972 Oct 19 17:57 ./runtutorials/trainedModelRegression.pt
-rw-r--r--. 1 pcanal us_cms  11044 Oct 19 18:02 ./tmva/pymva/test/PyTorchModelModule.pt
-rw-r--r--. 1 pcanal us_cms   8337 Oct 19 18:02 ./tmva/pymva/test/PyTorchModelConvolution.pt
-rw-r--r--. 1 pcanal us_cms 684930 Oct 19 18:02 ./runtutorials/PyTorchTrainedModelCNN.pt
-rw-r--r--. 1 pcanal us_cms 684658 Oct 19 18:02 ./runtutorials/PyTorchModelCNN.pt

is making gtest-tmva-pymva-TestRModelParserPyTorch fail.

pcanal avatar Oct 19 '24 18:10 pcanal

However gtest-tmva-pymva-TestRModelParserKeras fails without or without those files.

pcanal avatar Oct 19 '24 18:10 pcanal

Apparently it is the test itself that is not runnable a second time :(:(

jupyter-pcanal-rootdevel:quick-devel pcanal$ ctest -R gtest-tmva-pymva-TestRModelParserPyTorch
Test project /home/pcanal/root_working/build/quick-devel
    Start 349: gtest-tmva-pymva-TestRModelParserPyTorch
1/1 Test #349: gtest-tmva-pymva-TestRModelParserPyTorch ...   Passed   15.87 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =  16.13 sec
jupyter-pcanal-rootdevel:quick-devel pcanal$ ctest -R gtest-tmva-pymva-TestRModelParserPyTorch
Test project /home/pcanal/root_working/build/quick-devel
    Start 349: gtest-tmva-pymva-TestRModelParserPyTorch
1/1 Test #349: gtest-tmva-pymva-TestRModelParserPyTorch ...***Failed    9.29 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   9.55 sec

The following tests FAILED:
        349 - gtest-tmva-pymva-TestRModelParserPyTorch (Failed)
Errors while running CTest
Output from these tests are in: /home/pcanal/root_working/build/quick-devel/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

pcanal avatar Oct 19 '24 18:10 pcanal

Re-assigning to @lmoneta

dpiparo avatar Oct 19 '24 21:10 dpiparo

@pcanal , could you please re-summarise the status also given the better understanding we have of https://github.com/root-project/root/issues/16720 ?

dpiparo avatar Oct 21 '24 04:10 dpiparo

The summary is simple (and still the same after applying 38b0d88 (#16722):

On first run in a clean directory with BLAS missing, we get:

ctest -R gtest-tmva-pymva-TestRModelParserPyTorch
Test project /home/pcanal/root_working/build/quick-devel
    Start 349: gtest-tmva-pymva-TestRModelParserPyTorch
1/1 Test #349: gtest-tmva-pymva-TestRModelParserPyTorch ...   Passed   16.11 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =  16.37 sec

and if we immediately re-run we get:

ctest -R gtest-tmva-pymva-TestRModelParserPyTorch
Test project /home/pcanal/root_working/build/quick-devel
    Start 349: gtest-tmva-pymva-TestRModelParserPyTorch
1/1 Test #349: gtest-tmva-pymva-TestRModelParserPyTorch ...***Failed    9.10 sec

and the error is:

[ RUN      ] RModelParser_PyTorch.SEQUENTIAL_MODEL
IncrementalExecutor::executeFunction: symbol 'sgemm_' unresolved while linking [cling interface function]!

indicates that on the 2nd runs, the test want symbols from the BLAS library.

pcanal avatar Oct 21 '24 17:10 pcanal

@guitargeek @pcanal is this issue maybe fixed by https://github.com/root-project/root/pull/18257 ?

Also related: https://github.com/root-project/root/issues/16553

ferdymercury avatar Aug 01 '25 10:08 ferdymercury

This issue appears to indeed be solved.

pcanal avatar Aug 27 '25 17:08 pcanal

Hi @pcanal, @lmoneta,

It appears this issue is closed, but wasn't yet added to a project. Please add upcoming versions that will include the fix, or 'not applicable' otherwise.

Sincerely, :robot:

github-actions[bot] avatar Aug 28 '25 06:08 github-actions[bot]