This PR moves deviceHint in previous implementations to the Config constructor, and create a torch::Device type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server.

Tagging ExaTrkX aaS people here @xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19

Feb 05 '24 12:02 hrzhao76

📊: Physics performance monitoring for d3805488a5e383680b02cdf02fe5f514b248ece8

Full contents

physmon summary

Feb 05 '24 15:02 github-actions[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 48.86%. Comparing base (92aca1c) to head (3ca9f70).

:exclamation: Current head 3ca9f70 differs from pull request most recent head d380548

Please upload reports for the commit d380548 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2925      +/-   ##
==========================================
+ Coverage   47.24%   48.86%   +1.62%     
==========================================
  Files         508      493      -15     
  Lines       30041    29058     -983     
  Branches    14586    13798     -788     
==========================================
+ Hits        14192    14200       +8     
+ Misses       5375     4962     -413     
+ Partials    10474     9896     -578

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 01 '24 09:03 codecov[bot]

2 problems here:

codecov is failing
ci bridge wont run

@paulgessinger can you add @hrzhao76 to the bridge user list?

Mar 01 '24 14:03 andiwand

@andiwand Thought I did already, but done now anyway.

Mar 01 '24 15:03 paulgessinger

CI Bridge / build_exatrkx is failing https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22177563102 @hrzhao76

Mar 01 '24 15:03 andiwand

CI Bridge / build_exatrkx is failing https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22177563102 @hrzhao76

@andiwand thanks for the message. I fix it in this commit 49a628a, then manually do the unit test and it works.

However, now the ci fails when downloading the Geant4 pkg. Seems URL changed. https://gitlab.cern.ch/acts/ci-bridge/-/jobs/36678358#L340

it also fails with the ACTS_LOG_FAILURE_THRESHOLD=WARNING. in this case would you suggest to change the Warning to Info if GPU is not available?

https://gitlab.cern.ch/acts/ci-bridge/-/jobs/36678356#L362

Mar 04 '24 08:03 hrzhao76

@hrzhao76 The CI job runs on a machine which does have a GPU available. The WARNING seems to indicate to me that it's failing to correctly select the GPU. I would caution against downgrading the WARNING, because it means that we would stop running GPU tests.

Mar 12 '24 15:03 paulgessinger

https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22552545038

09:27:40    MetricLearni   WARNING   GPU device 0 not available. Using CPU instead.
FPE masks:
- Fatras/include/ActsFatras/Kernel/detail/SimulationActor.hpp:197: FLTUND: 1
- Fatras/include/ActsFatras/Physics/ElectroMagnetic/BetheHeitler.hpp:65: FLTUND: 1
- Examples/Io/Root/src/RootTrackStatesWriter.cpp:590: FLTINV: 1
- Examples/Io/Root/src/RootTrackSummaryWriter.cpp:419: FLTINV: 1
- Core/src/Vertexing/AdaptiveMultiVertexFinder.cpp:480: FLTUND: 1
- Core/include/Acts/TrackFitting/detail/GsfComponentMerging.hpp:88: FLTUND: 1
- Core/include/Acts/TrackFitting/detail/GsfComponentMerging.hpp:198: FLTUND: 1
Traceback (most recent call last):
  File "/builds/acts/ci-bridge/src/Examples/Scripts/Python/exatrkx.py", line 75, in <module>
    addExaTrkX(
  File "/builds/acts/ci-bridge/build/python/acts/examples/reconstruction.py", line 1530, in addExaTrkX
    graphConstructor = acts.examples.TorchMetricLearning(**metricLearningConfig)
  File "/builds/acts/ci-bridge/build/python/acts/_adapter.py", line 74, in wrapped
    fn(self, *args, **kwargs)
  File "/builds/acts/ci-bridge/build/python/acts/_adapter.py", line 40, in wrapped
    fn(self, cfg, *args, **_kwargs)
acts.ActsPythonBindings.logging.ThresholdFailure: Previous debug message exceeds the ACTS_LOG_FAILURE_THRESHOLD=WARNING configuration, bailing out.

If we want to run on a GPU there seems to be a problem with the selection. If CPU is wanted the ACTS_LOG_FAILURE_THRESHOLD seems to kill the process.

@hrzhao76

Mar 12 '24 15:03 andiwand

@hrzhao76 are we moving forward with this? Otherwise I'll close this PR.

Jun 21 '24 06:06 paulgessinger

In principle I like these changes, so we should try to merge them. Could the CI issue with the GPU be related to the fact, that the previous default device type was -1, and no it is defaulted to 0? This wouldn't make much sense to me, but is the only clear difference I see...

Jun 21 '24 08:06 benjaminhuth

@hrzhao76 are we moving forward with this? Otherwise I'll close this PR.

Sorry for the late reply. I'm looking into this ci bridge failure.

Jun 21 '24 10:06 hrzhao76

Hello @benjaminhuth @paulgessinger @andiwand Thank you for pointing out the CI bridge failure. The issue arises because one of the examples is tested with only the CPU, and CUDA_VISIBLE_DEVICES is empty. You can see this at line 1302 in test_examples.py. The original device selection logic reports a ACTS_WARNING anyway and fails to run in such a CPU-only testing even though on a GPU equipped machine. I have improved the logic, and it should work correctly now. Let's wait for the ci bridge results.

Jun 21 '24 11:06 hrzhao76

Thanks @hrzhao76! There's also still conflict in thirdparty/FRNN/CMakeLists.txt, could you resolve that?

Jun 21 '24 11:06 paulgessinger

Thanks a lot for the update! I have only a some nitpick-comments on the logging.

One idea: Would it be worth to factor out and centralize the device selection logic to a seperate function, maybe in .../include/Acts/Plugins/ExaTrkX/detail/deviceSelection.hpp or so?

I've corrected the logging. Regarding the deviceSelection.hpp, perhaps we can consider this for the future? Currently, only TorchScript follows this method, while ONNX uses a different approach.

Jun 21 '24 13:06 hrzhao76

Thanks @hrzhao76! There's also still conflict in thirdparty/FRNN/CMakeLists.txt, could you resolve that?

Thanks for pointing out. I've resolved this conflict after pulling the new commits. Also fixed a bug to avoid calling CUDAGuard when the device is kCPU, which failed in last round of ci bridge...

Jun 21 '24 13:06 hrzhao76

:red_circle: Athena integration test results [cf9d8721bf662d2fb040cbc6c03b54e618c567b4]

:red_circle: Some tests have failed!

Please investigate the pipeline!

status	job	report
:green_circle:	run_unit_tests
:green_circle:	test_ActsDumpGeometryIdentifiers
:green_circle:	test_ActsCheckObjectCountsWorkflow
:green_circle:	test_ActsEFTrackFit
:green_circle:	test_ActsPersistifySeeds
:green_circle:	test_ActsBenchmarkWithSpot
:green_circle:	test_ActsAnalogueClustering
:green_circle:	test_ActsWorkflowHeavyIons
:green_circle:	test_ActsWorkflowFastTracking
:green_circle:	test_ActsWorkflowCached
:green_circle:	test_ActsWorkflow
:green_circle:	test_ActsValidateAmbiguityResolution
:green_circle:	test_ActsValidateResolvedTracks
:green_circle:	test_ActsValidateTracks
:green_circle:	test_ActsValidateActsCoreSpacePoints
:green_circle:	test_ActsValidateActsSpacePoints
:green_circle:	test_ActsValidateSeeds
:green_circle:	test_ActsValidateOrthogonalSeeds
:green_circle:	test_ActsValidateClusters
:green_circle:	test_ActsPersistifyEDM
:green_circle:	test_ActsGSFRefitting
:green_circle:	test_ActsKfRefitting
:green_circle:	test_ActsExtrapolationAlgTest
:green_circle:	test_ActsITkTest
:green_circle:	run_workflow_tests_run4_mc
:green_circle:	run_workflow_tests_run2_mc
:green_circle:	run_workflow_tests_run2_data
:green_circle:	run_workflow_tests_run3_mc
:green_circle:	run_workflow_tests_run3_data
:red_circle:	run_art_test: test_data18_13TeV_1000evt
:green_circle:	run_art_test: test_ttbarPU40_reco

Jun 21 '24 17:06 acts-project-service

refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib

📊: Physics performance monitoring for d3805488a5e383680b02cdf02fe5f514b248ece8

physmon summary

Codecov Report

:red_circle: Athena integration test results [cf9d8721bf662d2fb040cbc6c03b54e618c567b4]

:red_circle: Some tests have failed!