chakra icon indicating copy to clipboard operation
chakra copied to clipboard

trace_linker: [WARNING]: No CUDA runtime operator found for correlation ID

Open 9LLPPLL6 opened this issue 6 months ago • 3 comments

Please provide a detailed description of your question or the information you seek.

I encountered the following warning while using chakra link:

[2025-06-04 04:00:01,109] trace_linker.py:679 [WARNING]: No CUDA runtime operator found for correlation ID 17297502. This is not a common case, and there should be a corresponding CUDA runtime operator for a given GPU kernel operator. It can be a case where CUDA runtime operators are not properly identified and added to the map, kineto_correlation_cuda_runtime_map. Please manually check if the corresponding CUDA runtime operator with the correlation is dropped by mistake. It is likely that it is because of incomplete map, cuda_launch_operations, in is_kernel_launch_op. Please update the map properly to cover all CUDA runtime launch operators.
[2025-06-04 04:00:01,109] trace_linker.py:625 [WARNING]: Missing parent CPU operator for GPU op 'void at::native::(anonymous namespace)::multi_tensor_apply_kernel<at::native::(anonymous namespace)::TensorListScalarListMetadata<float, 3>, at::native::(anonymous namespace)::PointwiseOpScalarListFunctor<float, 3, 3, 0>, std::divides<float> >(at::native::(anonymous namespace)::TensorListScalarListMetadata<float, 3>, at::native::(anonymous namespace)::PointwiseOpScalarListFunctor<float, 3, 3, 0>, std::divides<float>)'. Orphaned GPU operator.

I run distributed training with dp=4 on 4XA6000 machines. Below is the repository where I train repo, I'm not sure if this warning is due to me doing something wrong, I didn't run into this problem while collecting traces while training on megatron.

I also raised this issue in the pytorch community issue The trace of rank0 will not encounter this problem, but some kernels of other ranks will encounter this problem, I am not sure if this is normal.

9LLPPLL6 avatar Jun 04 '25 05:06 9LLPPLL6

@TaekyungHeo @srinivas212 @rvinaybharadwaj @AlexandruAntonescuKeysight @JoongunPark @tushar-krishna @nathanw-mlc

9LLPPLL6 avatar Jun 04 '25 05:06 9LLPPLL6

Can you check your kineto to see if you can find such cudaLaunchKernelExC with same correlation as your failing collective?

Image

theodorbadea avatar Jun 13 '25 12:06 theodorbadea

Can you check your kineto to see if you can find such cudaLaunchKernelExC with same correlation as your failing collective?

Image

Some of my computing kernels do not have corresponding cudaLaunchkernels and only appear for a short period of time

9LLPPLL6 avatar Jun 14 '25 12:06 9LLPPLL6