refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib
This PR moves deviceHint in previous implementations to the Config constructor, and create a torch::Device type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server.
Tagging ExaTrkX aaS people here @xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19
📊: Physics performance monitoring for d3805488a5e383680b02cdf02fe5f514b248ece8
physmon summary
- ✅ CKF truth_smeared
- ✅ IVF truth_smeared
- ✅ AMVF truth_smeared
- ✅ Track Summary CKF truth_smeared
- ✅ Seeding truth_estimated
- ✅ CKF truth_estimated
- ✅ IVF truth_estimated
- ✅ AMVF truth_estimated
- ✅ Track Summary CKF truth_estimated
- ✅ Seeding seeded
- ✅ CKF seeded
- ✅ IVF seeded
- ✅ AMVF seeded
- ✅ AMVF (+grid seeder) seeded
- ✅ Track Summary CKF seeded
- ✅ Seeding orthogonal
- ✅ CKF orthogonal
- ✅ IVF orthogonal
- ✅ AMVF orthogonal
- ✅ Track Summary CKF orthogonal
- ✅ Ambisolver seeded
- ✅ Ambisolver orthogonal
- ✅ Seeding ttbar
- ✅ CKF ttbar
- ✅ Ambisolver
- ✅ Track Summary CKF ttbar
- ✅ AMVF ttbar
- ✅ AMVF (+grid seeder) ttbar
- ✅ Truth tracking (GSF)
- ✅ Truth tracking
- ✅ Truth tracking (GX2F)
- ✅ Particles fatras
- ✅ Particles geant4
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 48.86%. Comparing base (
92aca1c) to head (3ca9f70).
:exclamation: Current head 3ca9f70 differs from pull request most recent head d380548
Please upload reports for the commit d380548 to get more accurate results.
Additional details and impacted files
@@ Coverage Diff @@
## main #2925 +/- ##
==========================================
+ Coverage 47.24% 48.86% +1.62%
==========================================
Files 508 493 -15
Lines 30041 29058 -983
Branches 14586 13798 -788
==========================================
+ Hits 14192 14200 +8
+ Misses 5375 4962 -413
+ Partials 10474 9896 -578
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
2 problems here:
- codecov is failing
- ci bridge wont run
@paulgessinger can you add @hrzhao76 to the bridge user list?
@andiwand Thought I did already, but done now anyway.
CI Bridge / build_exatrkx is failing https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22177563102 @hrzhao76
CI Bridge / build_exatrkxis failing https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22177563102 @hrzhao76
@andiwand thanks for the message. I fix it in this commit 49a628a, then manually do the unit test and it works.
However, now the ci fails when downloading the Geant4 pkg. Seems URL changed. https://gitlab.cern.ch/acts/ci-bridge/-/jobs/36678358#L340
it also fails with the ACTS_LOG_FAILURE_THRESHOLD=WARNING. in this case would you suggest to change the Warning to Info if GPU is not available?
https://gitlab.cern.ch/acts/ci-bridge/-/jobs/36678356#L362
@hrzhao76 The CI job runs on a machine which does have a GPU available. The WARNING seems to indicate to me that it's failing to correctly select the GPU. I would caution against downgrading the WARNING, because it means that we would stop running GPU tests.
https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22552545038
09:27:40 MetricLearni WARNING GPU device 0 not available. Using CPU instead.
FPE masks:
- Fatras/include/ActsFatras/Kernel/detail/SimulationActor.hpp:197: FLTUND: 1
- Fatras/include/ActsFatras/Physics/ElectroMagnetic/BetheHeitler.hpp:65: FLTUND: 1
- Examples/Io/Root/src/RootTrackStatesWriter.cpp:590: FLTINV: 1
- Examples/Io/Root/src/RootTrackSummaryWriter.cpp:419: FLTINV: 1
- Core/src/Vertexing/AdaptiveMultiVertexFinder.cpp:480: FLTUND: 1
- Core/include/Acts/TrackFitting/detail/GsfComponentMerging.hpp:88: FLTUND: 1
- Core/include/Acts/TrackFitting/detail/GsfComponentMerging.hpp:198: FLTUND: 1
Traceback (most recent call last):
File "/builds/acts/ci-bridge/src/Examples/Scripts/Python/exatrkx.py", line 75, in <module>
addExaTrkX(
File "/builds/acts/ci-bridge/build/python/acts/examples/reconstruction.py", line 1530, in addExaTrkX
graphConstructor = acts.examples.TorchMetricLearning(**metricLearningConfig)
File "/builds/acts/ci-bridge/build/python/acts/_adapter.py", line 74, in wrapped
fn(self, *args, **kwargs)
File "/builds/acts/ci-bridge/build/python/acts/_adapter.py", line 40, in wrapped
fn(self, cfg, *args, **_kwargs)
acts.ActsPythonBindings.logging.ThresholdFailure: Previous debug message exceeds the ACTS_LOG_FAILURE_THRESHOLD=WARNING configuration, bailing out.
If we want to run on a GPU there seems to be a problem with the selection. If CPU is wanted the ACTS_LOG_FAILURE_THRESHOLD seems to kill the process.
@hrzhao76
@hrzhao76 are we moving forward with this? Otherwise I'll close this PR.
In principle I like these changes, so we should try to merge them. Could the CI issue with the GPU be related to the fact, that the previous default device type was -1, and no it is defaulted to 0?
This wouldn't make much sense to me, but is the only clear difference I see...
@hrzhao76 are we moving forward with this? Otherwise I'll close this PR.
Sorry for the late reply. I'm looking into this ci bridge failure.
Hello @benjaminhuth @paulgessinger @andiwand Thank you for pointing out the CI bridge failure. The issue arises because one of the examples is tested with only the CPU, and CUDA_VISIBLE_DEVICES is empty. You can see this at line 1302 in test_examples.py. The original device selection logic reports a ACTS_WARNING anyway and fails to run in such a CPU-only testing even though on a GPU equipped machine. I have improved the logic, and it should work correctly now. Let's wait for the ci bridge results.
Thanks @hrzhao76! There's also still conflict in thirdparty/FRNN/CMakeLists.txt, could you resolve that?
Thanks a lot for the update! I have only a some nitpick-comments on the logging.
One idea: Would it be worth to factor out and centralize the device selection logic to a seperate function, maybe in
.../include/Acts/Plugins/ExaTrkX/detail/deviceSelection.hppor so?
I've corrected the logging. Regarding the deviceSelection.hpp, perhaps we can consider this for the future? Currently, only TorchScript follows this method, while ONNX uses a different approach.
Thanks @hrzhao76! There's also still conflict in
thirdparty/FRNN/CMakeLists.txt, could you resolve that?
Thanks for pointing out. I've resolved this conflict after pulling the new commits. Also fixed a bug to avoid calling CUDAGuard when the device is kCPU, which failed in last round of ci bridge...
:red_circle: Athena integration test results [cf9d8721bf662d2fb040cbc6c03b54e618c567b4]
:red_circle: Some tests have failed!
Please investigate the pipeline!