Is it able to profile for Gloo backend and distributed CPU settings?
The example is running on the NCCL backend for distributed GPU settings. I'm wondering if it can profile correctly on a multi-node (multiple CPU servers) distributed CPU settings with Gloo backend?
I try to change the example code: change NCCL to Gloo backend, and make the device as CPU. It could generate reports for distributed and memory view. I think that is correct? But it is only on one-machine.
I'm curious that if multiple machine settings are supported. Thanks!
Also, it might worth to mention that: when I try to profile the CPU, on the first step it would generate a warning:
[W CPUAllocator.cpp:305] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
But the following steps don't generate the warning.
What issues did you run into when profiling a multi-CPU host?
It happens so long ago, so I cannot remember the details... I think the problem might be that the report shows a profiling result only for the host code that calls the function on the other CPU nodes, so that the runtime and memory in the report is very small. (If I recalled correctly, but I'm not very sure on that)