kineto icon indicating copy to clipboard operation
kineto copied to clipboard

Distributed view empty and no communication shown

Open aamijar opened this issue 2 years ago • 4 comments

Hi, I am using the sample script in this repository resnet50_ddp_profiler.py from https://github.com/pytorch/kineto/blob/main/tb_plugin/examples/resnet50_ddp_profiler.py

Using

Python3.8
torch=2.0.1
torch-tb-profiler=0.4.3 # built from source

In tensorboard in the overview view the communication is 0. In the distributed view:

  • there are no bar charts shown for Synchronizing/Communication Overview.
  • the table at the bottom called Communication Operation stats has 0 values in columns total latency, avg latency, data transfer time, avg data transfer time.

When I try using

Python3.8
torch=1.11.0
torch-tb-profiler=0.4.3 # built from source

There are no issues and the views show up properly.

However even for torch=1.12+ there are issues in communication and distributed view not showing up properly.

Does anyone have any insight into why this may be the case?

aamijar avatar Jul 15 '23 00:07 aamijar

I'm looking at the .json logs for both of these runs.

An observation I found is that the torch=2.0.1 generated .json specifically for the objects in the json that has the name "ncclKernel_AllReduce_RING_LL_Sum_float(ncclDevComm*, unsigned long, ncclWork*)"

External id and correlation fields are the same value

whereas in torch=1.11.0 External id and correlation fields have different values

in torch=1.11.0 the External id also match with various other .json objects where the name can be cudaEventRecord, cudaLaunchKernel etc.

This is not the case in the torch=2.0.1 generated .json

aamijar avatar Jul 15 '23 01:07 aamijar

@aaronenyeshi Do you know of any ways to resolve this and are you able to replicate the results from above?

aamijar avatar Jul 18 '23 20:07 aamijar

Unfortunately, we are lacking resources to fix tb_plugin bugs. Plans for it are still pending.

However, the OSS community is free to submit fixes for these issues via Github PRs.

aaronenyeshi avatar Jul 18 '23 20:07 aaronenyeshi

Any plan on this?

npuichigo avatar Aug 24 '23 06:08 npuichigo