Incorrect JSON format during Pytorch Execution Trace generation
Bug Description
I am running a distributed Linear model (20 parameters) across 2 GPU Nodes, each node having 2 NVIDIA H100 NVL GPUs. The Model uses DDP parallelization strategy. I am generating the Pytorch ET (in json format) trace using the ExecutionTraceObserver() as mentioned in the instructions. I observe that the trace has many syntactical errors. Also, many nodes have incomplete data (images attached)
I tried this with the latest Pytorch version (2.5.0) as well but encountered the same problem.
Steps to Reproduce
Code used for distributed training: https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series
Command to run across both nodes:
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv-backend=c10d --rdzv_endpoint=<ip:port> <code>.py <no. of epochs> <epochs after which result will be saved>
I am capturing the ET trace for one epoch.
Information for one GPU Node (Both nodes have the same configuration): Pytorch: 2.1.2 , 2.5.1 (tried both) OS: Linux Kernel version: 5.15.0-124-generic Ubuntu Version: Ubuntu 22.04.5 No. of CPUs : 64 CPU Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit CPU Address sizes: 52 bits physical, 57 bits virtual CPU Byte Order: Little Endian Memory: 503Gi No. of GPUs: 2 GPU Memory (each GPU): 95830MiB
I would be obliged if someone could help in this regard.
Screenshots
Hey, try collecting smaller traces from pytorch's execution trace observer. Try collecting one iteration and you get valid json.
This is a pretty annoying quirk of the execution trace observer serializer
I did try this for a single epoch though.
I did try this for a single epoch though.我确实尝试过一次这样的尝试。
hello, Have you solved this problem now
I did. I generated seperate Execution traces for each GPU. Seems that there was some kind of race condition happening between both the GPUs writing to the same execution file.