chakra icon indicating copy to clipboard operation
chakra copied to clipboard

Incorrect JSON format during Pytorch Execution Trace generation

Open arjuntemura opened this issue 1 year ago • 4 comments

Bug Description

I am running a distributed Linear model (20 parameters) across 2 GPU Nodes, each node having 2 NVIDIA H100 NVL GPUs. The Model uses DDP parallelization strategy. I am generating the Pytorch ET (in json format) trace using the ExecutionTraceObserver() as mentioned in the instructions. I observe that the trace has many syntactical errors. Also, many nodes have incomplete data (images attached)
I tried this with the latest Pytorch version (2.5.0) as well but encountered the same problem.

Steps to Reproduce

Code used for distributed training: https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series Command to run across both nodes: torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv-backend=c10d --rdzv_endpoint=<ip:port> <code>.py <no. of epochs> <epochs after which result will be saved> I am capturing the ET trace for one epoch.

Information for one GPU Node (Both nodes have the same configuration): Pytorch: 2.1.2 , 2.5.1 (tried both) OS: Linux Kernel version: 5.15.0-124-generic Ubuntu Version: Ubuntu 22.04.5 No. of CPUs : 64 CPU Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit CPU Address sizes: 52 bits physical, 57 bits virtual CPU Byte Order: Little Endian Memory: 503Gi No. of GPUs: 2 GPU Memory (each GPU): 95830MiB

I would be obliged if someone could help in this regard.

Screenshots

incomplete_output1 incomplete_output2 incomplete_output3 syntactical_error1 syntactical_error2 syntactical_error3 syntactical_error4

arjuntemura avatar Nov 08 '24 08:11 arjuntemura

Hey, try collecting smaller traces from pytorch's execution trace observer. Try collecting one iteration and you get valid json.

This is a pretty annoying quirk of the execution trace observer serializer

wkaisertexas avatar Nov 18 '24 19:11 wkaisertexas

I did try this for a single epoch though.

arjuntemura avatar Nov 26 '24 08:11 arjuntemura

I did try this for a single epoch though.我确实尝试过一次这样的尝试。

hello, Have you solved this problem now

9LLPPLL6 avatar Apr 22 '25 09:04 9LLPPLL6

I did. I generated seperate Execution traces for each GPU. Seems that there was some kind of race condition happening between both the GPUs writing to the same execution file.

arjuntemura avatar Apr 22 '25 12:04 arjuntemura