xla torch xla gpu training unable to profile

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

PJRT_DEVICE=CUDA python test_train_spmd_imagenet.py --fake_data --batch_size 16 --model=resnet50 --sharding=batch --profile
another process use xp.trace('localhost:9012', '/tmp/tensorboard') 3.the main process will print this error message

2024-01-31 16:14:39.735250: W external/tsl/tsl/profiler/lib/profiler_session.cc:110] Profiling is late by 4060324 nanoseconds and will start immediately. 2024-01-31 16:14:41.905609: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:163] cuptiEnableCallback: error 1: CUPTI_ERROR_INVALID_PARAMETER 2024-01-31 16:14:41.905656: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:449] CuptiErrorManager is disabling profiling automatically. 2024-01-31 16:14:41.905663: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:449] CuptiErrorManager is disabling profiling automatically. 2024-01-31 16:14:41.905678: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:459] cuptiGetResultString: ignored due to a previous error. 2024-01-31 16:14:41.905684: E external/xla/xla/backends/profiler/gpu/cupti_tracer.cc:1940] function cupti_interface_->EnableCallback( 1 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error | Training Device=xla:0/0 Epoch=1 Step=580 Loss=0.19359 Rate=132.86 GlobalRate=117.30 Time=16:14:42 2024-01-31 16:14:43.064375: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:153] cuptiEnableCallback: ignored due to a previous error. 2024-01-31 16:14:43.064417: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:459] cuptiGetResultString: ignored due to a previous error. 2024-01-31 16:14:43.064425: E external/xla/xla/backends/profiler/gpu/cupti_tracer.cc:1962] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error 2024-01-31 16:14:43.064434: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:142] cuptiFinalize: ignored due to a previous error. 2024-01-31 16:14:43.064439: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:459] cuptiGetResultString: ignored due to a previous error. 2024-01-31 16:14:43.064444: E external/xla/xla/backends/profiler/gpu/cupti_tracer.cc:2026] function cupti_interface_->Finalize()failed with error 2024-01-31 16:14:43.351881: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:135] cuptiGetTimestamp: ignored due to a previous error. 2024-01-31 16:14:43.352045: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:135] cuptiGetTimestamp: ignored due to a previous error. | Training Device=xla:0/0 Epoch=1 Step=600 Loss=0.16818 Rate=171.74 GlobalRate=118.91 Time=16:14:44

Environment

Reproducible on XLA backend [CPU/TPU]: gpu
torch_xla version: 2.2.0+git0db490d
torch version: 2.3.0a0+git4451ca0
cuda-version:11.8

Additional context

Jan 31 '24 08:01 mars1248

@vanbasten23 has you ever run into this kind of issues?

Jan 31 '24 18:01 JackCaoG

Taking a look

Feb 01 '24 23:02 vanbasten23

@vanbasten23 Hello, what's going on? Could it be that my cuda version is too old? Does cupti for torch xla require 12.0+ cuda?

Feb 05 '24 12:02 mars1248

cc @ysiraichi

Feb 05 '24 18:02 miladm

I'm using cuda 12.1 and I didn't see the error.

I got the trace this way:

# in my container
root@xiowei-gpu:/ansible# PJRT_DEVICE=CUDA python pytorch/xla/test/spmd/test_train_spmd_imagenet.py --fake_data --batch_size 16 --model=resnet50 --sharding=batch --profile &

# while the training script is running, let's capture the trace
root@xiowei-gpu:/ansible# wget https://raw.githubusercontent.com/pytorch/xla/master/scripts/capture_profile.py
root@xiowei-gpu:/ansible# python3 capture_profile.py --service_addr "localhost:9012" --logdir ~/profiles/ --duration_ms 2000

After that, I found the trace in /root/profiles/plugins/profile/2024_02_06_01_10_46/localhost_9012.xplane.pb. Then you can upload it to the tensorboard.

With that said, could you try using our cuda 12.1 docker image and retry?

Feb 06 '24 01:02 vanbasten23

With the profile I captured, I can see a very nice memory viewer:

It's probably not as helpful as the torch memory viewer in terms of debugging OOM issues. But just an FYI @ysiraichi . Probably xla folks may have more insights about the memory viewer.

Feb 06 '24 19:02 vanbasten23

@vanbasten23 Thanks. I will give it a go.

Feb 06 '24 19:02 ysiraichi

You can find tutorials here https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm#profiling . And you may need to use the profiler from xla instead of from pytorch: import torch_xla.debug.profiler as xp

Feb 06 '24 19:02 vanbasten23

I'm using cuda 12.1 and I didn't see the error.

I got the trace this way:
# in my container
root@xiowei-gpu:/ansible# PJRT_DEVICE=CUDA python pytorch/xla/test/spmd/test_train_spmd_imagenet.py --fake_data --batch_size 16 --model=resnet50 --sharding=batch --profile &

# while the training script is running, let's capture the trace
root@xiowei-gpu:/ansible# wget https://raw.githubusercontent.com/pytorch/xla/master/scripts/capture_profile.py
root@xiowei-gpu:/ansible# python3 capture_profile.py --service_addr "localhost:9012" --logdir ~/profiles/ --duration_ms 2000
After that, I found the trace in /root/profiles/plugins/profile/2024_02_06_01_10_46/localhost_9012.xplane.pb. Then you can upload it to the tensorboard.

With that said, could you try using our cuda 12.1 docker image and retry?

Thank you very much, I upgraded to cuda12.1 and can successfully profile. But I saw such an Warning in tensorboard： No step marker observed and hence the step time is unknown. This may happen if (1) training steps are not instrumented (e.g., if you are not using Keras) or (2) the profiling duration is shorter than the step time. For (1), you need to add step instrumentation; for (2), you may try to profile longer. and then, Why is my tensorboard different from yours，and my task GPU FLOPS utilization is 0.0%? this is my tensorboard:

Feb 22 '24 08:02 mars1248

hey @mars1248 , I did see the same warning as yours with --duration_ms 2000. So I follow the warning "you may try to profile longer" by using a larger duration, for example 20000, then I don't see the warning anymore. However, on the op_profile view, I also see Overall GPU FLOPS utilization is 0.0%.

Mar 01 '24 01:03 vanbasten23

I'm checking with tensorboard team on the issue Overall GPU FLOPS utilization is 0.0% @mars1248

Mar 05 '24 18:03 vanbasten23

I wrote such a single test in gpu spmd training, hoping to profile amp training scenarios, but there is such an error RuntimeError: Expecting scope to be empty but it is train_loop.1

import torch_xla
import torch_xla.runtime as xr
import torch_xla.core.xla_model as xm
import torch_xla.experimental.xla_sharding as xs
from torch_xla.experimental.xla_sharding import Mesh
from torch_xla.amp import autocast, GradScaler
import numpy as np
import torch.optim as optim
import torch_xla.debug.profiler as xp
# Enable XLA SPMD execution mode.
xr.use_spmd()

# Assuming 4 GPU devices; PJRT_DEVICE=GPU, NUM_GPU_DEVICES=4
num_devices = xr.global_runtime_device_count()
print(num_devices)

# mesh shape will be (2,2) in this example
mesh_shape = (num_devices // 2, 2)
device_ids = np.array(range(num_devices))
# axis_names 'x' nad 'y' are optional
mesh = Mesh(device_ids, mesh_shape, ('x', 'y'))

t1 = torch.randn(16, 36, device='cpu')

xt1 = t1.to(xm.xla_device())
xs.mark_sharding(xt1, mesh, (0, 1))

# check to see if the spmd computation used all devices
print(torch_xla._XLAC._get_xla_sharding_spec(xt1))

model = torch.nn.Linear(36, 128)
model.to(xm.xla_device())
loss_func = torch.nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
target = torch.randn(1, 128)
target = target.to(xm.xla_device())
scaler = GradScaler(use_zero_grad=False)
with xp.StepTrace('train_loop', step_num=0):
    with autocast(xm.xla_device()):
        output = model(xt1)
        loss = loss_func(output, target)
        print(f'Initial loss: {loss.item()}')
    scaler.scale(loss).backward()

    # output = model(xt1)
    # loss = loss_func(output, target)
    # print(f'Initial loss: {loss.item()}')
    # loss.backward()
    # xm.mark_step()


    gradients = xm._fetch_gradients(optimizer)
    xm.all_reduce('sum', gradients, scale=1.0 / xm.xrt_world_size())
    xm.mark_step()
    scaler.step(optimizer)
    scaler.update()
# print(gradients)```

Mar 06 '24 09:03 mars1248

I found that it was caused by this line of code. What is the purpose of this mark_step? https://github.com/pytorch/xla/blob/master/torch_xla/amp/grad_scaler.py#L77

Mar 06 '24 09:03 mars1248

The purpose of the mark_step is to make XLA compile and execute its current graph and materialize the tensors.

Mar 06 '24 21:03 vanbasten23

On another thought, can you comment out the line print(f'Initial loss: {loss.item()}') and try again? Also, which line in your script https://github.com/pytorch/xla/issues/6422#issuecomment-1980416414 fails?

Mar 07 '24 00:03 vanbasten23

@vanbasten23 scaler.step(optimizer) this line raise this exception RuntimeError: Expecting scope to be empty but it is train_loop.1

Mar 07 '24 08:03 mars1248

The error RuntimeError: Expecting scope to be empty but it is train_loop. indicates that the code calls mark_step within xp.StepTrace. Plus, xp.StepTrace add xm.mark_step() when exiting the scope per https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm#adding_traces.

And you probably want to remove the xm.mark_step() before the scaler.step(optimizer) while I'm looking at the example at https://github.com/pytorch/xla/blob/300bc8c1720aec38279eb6d72cbfa5e564a76ce3/test/test_train_mp_mnist_amp.py#L161.

Mar 07 '24 21:03 vanbasten23

Thank you for your answer. I have solved my problem。 Is there any way to see which op these cuda kernels are called by, preferably, it would let me see the corresponding python stack? You can see it in troch's trace @vanbasten23

Mar 13 '24 09:03 mars1248

I'm checking with tensorboard team on the issue Overall GPU FLOPS utilization is 0.0%

Just want to get back to you about it. The tensorboard told me this is a feature that has not been implemented yet.

Is there any way to see which op these cuda kernels are called by, preferably

How did you get the the above profile? Also, in your screenshot, do you mean ncclKernel_AllReduce.. calls ampere_fp16_s16816...?

Mar 15 '24 02:03 vanbasten23

xla xla copied to clipboard

torch xla gpu training unable to profile

🐛 Bug

To Reproduce

Environment

Additional context

xla
xla copied to clipboard