rccl icon indicating copy to clipboard operation
rccl copied to clipboard

Create internal thread to dump proxy information at given interval.

Open thananon opened this issue 10 months ago • 0 comments

Details

This PR creates a helper thread to dump out proxy operations at every given interval. This is useful to debug hangs in multi-node scenario. The thread is not created by default. It is controlled via param RCCL_PROXY_MONITOR_INTERVAL. If set to > 0, the thread will attempt to dump proxy info at that interval (seconds).

This should have minimal/no impact to the default mode of operation.

image

What were the changes?
One sentence describing the work done.

Why were the changes made?
More debugging tool for developers.

How was the outcome achieved?
Create a helper thread that dump out information.

Approval Checklist

Do not approve until these items are satisfied.

  • [ ] Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

thananon avatar Feb 06 '25 19:02 thananon