acts icon indicating copy to clipboard operation
acts copied to clipboard

refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib

Open hrzhao76 opened this issue 1 year ago • 8 comments

This PR moves deviceHint in previous implementations to the Config constructor, and create a torch::Device type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server.

Tagging ExaTrkX aaS people here @xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19

hrzhao76 avatar Feb 05 '24 12:02 hrzhao76

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 48.86%. Comparing base (92aca1c) to head (3ca9f70).

:exclamation: Current head 3ca9f70 differs from pull request most recent head d380548

Please upload reports for the commit d380548 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2925      +/-   ##
==========================================
+ Coverage   47.24%   48.86%   +1.62%     
==========================================
  Files         508      493      -15     
  Lines       30041    29058     -983     
  Branches    14586    13798     -788     
==========================================
+ Hits        14192    14200       +8     
+ Misses       5375     4962     -413     
+ Partials    10474     9896     -578     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 01 '24 09:03 codecov[bot]

2 problems here:

  • codecov is failing
  • ci bridge wont run

@paulgessinger can you add @hrzhao76 to the bridge user list?

andiwand avatar Mar 01 '24 14:03 andiwand

@andiwand Thought I did already, but done now anyway.

paulgessinger avatar Mar 01 '24 15:03 paulgessinger

CI Bridge / build_exatrkx is failing https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22177563102 @hrzhao76

andiwand avatar Mar 01 '24 15:03 andiwand

CI Bridge / build_exatrkx is failing https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22177563102 @hrzhao76

@andiwand thanks for the message. I fix it in this commit 49a628a, then manually do the unit test and it works.

However, now the ci fails when downloading the Geant4 pkg. Seems URL changed. https://gitlab.cern.ch/acts/ci-bridge/-/jobs/36678358#L340

it also fails with the ACTS_LOG_FAILURE_THRESHOLD=WARNING. in this case would you suggest to change the Warning to Info if GPU is not available?

https://gitlab.cern.ch/acts/ci-bridge/-/jobs/36678356#L362

hrzhao76 avatar Mar 04 '24 08:03 hrzhao76

@hrzhao76 The CI job runs on a machine which does have a GPU available. The WARNING seems to indicate to me that it's failing to correctly select the GPU. I would caution against downgrading the WARNING, because it means that we would stop running GPU tests.

paulgessinger avatar Mar 12 '24 15:03 paulgessinger

https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22552545038

09:27:40    MetricLearni   WARNING   GPU device 0 not available. Using CPU instead.
FPE masks:
- Fatras/include/ActsFatras/Kernel/detail/SimulationActor.hpp:197: FLTUND: 1
- Fatras/include/ActsFatras/Physics/ElectroMagnetic/BetheHeitler.hpp:65: FLTUND: 1
- Examples/Io/Root/src/RootTrackStatesWriter.cpp:590: FLTINV: 1
- Examples/Io/Root/src/RootTrackSummaryWriter.cpp:419: FLTINV: 1
- Core/src/Vertexing/AdaptiveMultiVertexFinder.cpp:480: FLTUND: 1
- Core/include/Acts/TrackFitting/detail/GsfComponentMerging.hpp:88: FLTUND: 1
- Core/include/Acts/TrackFitting/detail/GsfComponentMerging.hpp:198: FLTUND: 1
Traceback (most recent call last):
  File "/builds/acts/ci-bridge/src/Examples/Scripts/Python/exatrkx.py", line 75, in <module>
    addExaTrkX(
  File "/builds/acts/ci-bridge/build/python/acts/examples/reconstruction.py", line 1530, in addExaTrkX
    graphConstructor = acts.examples.TorchMetricLearning(**metricLearningConfig)
  File "/builds/acts/ci-bridge/build/python/acts/_adapter.py", line 74, in wrapped
    fn(self, *args, **kwargs)
  File "/builds/acts/ci-bridge/build/python/acts/_adapter.py", line 40, in wrapped
    fn(self, cfg, *args, **_kwargs)
acts.ActsPythonBindings.logging.ThresholdFailure: Previous debug message exceeds the ACTS_LOG_FAILURE_THRESHOLD=WARNING configuration, bailing out.

If we want to run on a GPU there seems to be a problem with the selection. If CPU is wanted the ACTS_LOG_FAILURE_THRESHOLD seems to kill the process.

@hrzhao76

andiwand avatar Mar 12 '24 15:03 andiwand

@hrzhao76 are we moving forward with this? Otherwise I'll close this PR.

paulgessinger avatar Jun 21 '24 06:06 paulgessinger

In principle I like these changes, so we should try to merge them. Could the CI issue with the GPU be related to the fact, that the previous default device type was -1, and no it is defaulted to 0? This wouldn't make much sense to me, but is the only clear difference I see...

benjaminhuth avatar Jun 21 '24 08:06 benjaminhuth

@hrzhao76 are we moving forward with this? Otherwise I'll close this PR.

Sorry for the late reply. I'm looking into this ci bridge failure.

hrzhao76 avatar Jun 21 '24 10:06 hrzhao76

Hello @benjaminhuth @paulgessinger @andiwand Thank you for pointing out the CI bridge failure. The issue arises because one of the examples is tested with only the CPU, and CUDA_VISIBLE_DEVICES is empty. You can see this at line 1302 in test_examples.py. The original device selection logic reports a ACTS_WARNING anyway and fails to run in such a CPU-only testing even though on a GPU equipped machine. I have improved the logic, and it should work correctly now. Let's wait for the ci bridge results.

hrzhao76 avatar Jun 21 '24 11:06 hrzhao76

Thanks @hrzhao76! There's also still conflict in thirdparty/FRNN/CMakeLists.txt, could you resolve that?

paulgessinger avatar Jun 21 '24 11:06 paulgessinger

Thanks a lot for the update! I have only a some nitpick-comments on the logging.

One idea: Would it be worth to factor out and centralize the device selection logic to a seperate function, maybe in .../include/Acts/Plugins/ExaTrkX/detail/deviceSelection.hpp or so?

I've corrected the logging. Regarding the deviceSelection.hpp, perhaps we can consider this for the future? Currently, only TorchScript follows this method, while ONNX uses a different approach.

hrzhao76 avatar Jun 21 '24 13:06 hrzhao76

Thanks @hrzhao76! There's also still conflict in thirdparty/FRNN/CMakeLists.txt, could you resolve that?

Thanks for pointing out. I've resolved this conflict after pulling the new commits. Also fixed a bug to avoid calling CUDAGuard when the device is kCPU, which failed in last round of ci bridge...

hrzhao76 avatar Jun 21 '24 13:06 hrzhao76

:red_circle: Athena integration test results [cf9d8721bf662d2fb040cbc6c03b54e618c567b4]

:red_circle: Some tests have failed!

Please investigate the pipeline!

status job report
:green_circle: run_unit_tests
:green_circle: test_ActsDumpGeometryIdentifiers
:green_circle: test_ActsCheckObjectCountsWorkflow
:green_circle: test_ActsEFTrackFit
:green_circle: test_ActsPersistifySeeds
:green_circle: test_ActsBenchmarkWithSpot
:green_circle: test_ActsAnalogueClustering
:green_circle: test_ActsWorkflowHeavyIons
:green_circle: test_ActsWorkflowFastTracking
:green_circle: test_ActsWorkflowCached
:green_circle: test_ActsWorkflow
:green_circle: test_ActsValidateAmbiguityResolution
:green_circle: test_ActsValidateResolvedTracks
:green_circle: test_ActsValidateTracks
:green_circle: test_ActsValidateActsCoreSpacePoints
:green_circle: test_ActsValidateActsSpacePoints
:green_circle: test_ActsValidateSeeds
:green_circle: test_ActsValidateOrthogonalSeeds
:green_circle: test_ActsValidateClusters
:green_circle: test_ActsPersistifyEDM
:green_circle: test_ActsGSFRefitting
:green_circle: test_ActsKfRefitting
:green_circle: test_ActsExtrapolationAlgTest
:green_circle: test_ActsITkTest
:green_circle: run_workflow_tests_run4_mc
:green_circle: run_workflow_tests_run2_mc
:green_circle: run_workflow_tests_run2_data
:green_circle: run_workflow_tests_run3_mc
:green_circle: run_workflow_tests_run3_data
:red_circle: run_art_test: test_data18_13TeV_1000evt
:green_circle: run_art_test: test_ttbarPU40_reco

acts-project-service avatar Jun 21 '24 17:06 acts-project-service