xla
xla copied to clipboard
torch xla gpu training unable to profile
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
-
PJRT_DEVICE=CUDA python test_train_spmd_imagenet.py --fake_data --batch_size 16 --model=resnet50 --sharding=batch --profile
- another process use
xp.trace('localhost:9012', '/tmp/tensorboard')
3.the main process will print this error message
2024-01-31 16:14:39.735250: W external/tsl/tsl/profiler/lib/profiler_session.cc:110] Profiling is late by 4060324 nanoseconds and will start immediately. 2024-01-31 16:14:41.905609: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:163] cuptiEnableCallback: error 1: CUPTI_ERROR_INVALID_PARAMETER 2024-01-31 16:14:41.905656: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:449] CuptiErrorManager is disabling profiling automatically. 2024-01-31 16:14:41.905663: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:449] CuptiErrorManager is disabling profiling automatically. 2024-01-31 16:14:41.905678: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:459] cuptiGetResultString: ignored due to a previous error. 2024-01-31 16:14:41.905684: E external/xla/xla/backends/profiler/gpu/cupti_tracer.cc:1940] function cupti_interface_->EnableCallback( 1 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error | Training Device=xla:0/0 Epoch=1 Step=580 Loss=0.19359 Rate=132.86 GlobalRate=117.30 Time=16:14:42 2024-01-31 16:14:43.064375: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:153] cuptiEnableCallback: ignored due to a previous error. 2024-01-31 16:14:43.064417: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:459] cuptiGetResultString: ignored due to a previous error. 2024-01-31 16:14:43.064425: E external/xla/xla/backends/profiler/gpu/cupti_tracer.cc:1962] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error 2024-01-31 16:14:43.064434: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:142] cuptiFinalize: ignored due to a previous error. 2024-01-31 16:14:43.064439: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:459] cuptiGetResultString: ignored due to a previous error. 2024-01-31 16:14:43.064444: E external/xla/xla/backends/profiler/gpu/cupti_tracer.cc:2026] function cupti_interface_->Finalize()failed with error 2024-01-31 16:14:43.351881: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:135] cuptiGetTimestamp: ignored due to a previous error. 2024-01-31 16:14:43.352045: E external/xla/xla/backends/profiler/gpu/cupti_error_manager.cc:135] cuptiGetTimestamp: ignored due to a previous error. | Training Device=xla:0/0 Epoch=1 Step=600 Loss=0.16818 Rate=171.74 GlobalRate=118.91 Time=16:14:44
Environment
- Reproducible on XLA backend [CPU/TPU]: gpu
- torch_xla version: 2.2.0+git0db490d
- torch version: 2.3.0a0+git4451ca0
- cuda-version:11.8
Additional context
@vanbasten23 has you ever run into this kind of issues?
Taking a look
@vanbasten23 Hello, what's going on? Could it be that my cuda version is too old? Does cupti for torch xla require 12.0+ cuda?
cc @ysiraichi
I'm using cuda 12.1 and I didn't see the error.
I got the trace this way:
# in my container
root@xiowei-gpu:/ansible# PJRT_DEVICE=CUDA python pytorch/xla/test/spmd/test_train_spmd_imagenet.py --fake_data --batch_size 16 --model=resnet50 --sharding=batch --profile &
# while the training script is running, let's capture the trace
root@xiowei-gpu:/ansible# wget https://raw.githubusercontent.com/pytorch/xla/master/scripts/capture_profile.py
root@xiowei-gpu:/ansible# python3 capture_profile.py --service_addr "localhost:9012" --logdir ~/profiles/ --duration_ms 2000
After that, I found the trace in /root/profiles/plugins/profile/2024_02_06_01_10_46/localhost_9012.xplane.pb. Then you can upload it to the tensorboard.
With that said, could you try using our cuda 12.1 docker image and retry?
With the profile I captured, I can see a very nice memory viewer:
It's probably not as helpful as the torch memory viewer in terms of debugging OOM issues. But just an FYI @ysiraichi . Probably xla folks may have more insights about the memory viewer.
@vanbasten23 Thanks. I will give it a go.
You can find tutorials here https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm#profiling . And you may need to use the profiler from xla instead of from pytorch: import torch_xla.debug.profiler as xp
I'm using cuda 12.1 and I didn't see the error.
I got the trace this way:
# in my container root@xiowei-gpu:/ansible# PJRT_DEVICE=CUDA python pytorch/xla/test/spmd/test_train_spmd_imagenet.py --fake_data --batch_size 16 --model=resnet50 --sharding=batch --profile & # while the training script is running, let's capture the trace root@xiowei-gpu:/ansible# wget https://raw.githubusercontent.com/pytorch/xla/master/scripts/capture_profile.py root@xiowei-gpu:/ansible# python3 capture_profile.py --service_addr "localhost:9012" --logdir ~/profiles/ --duration_ms 2000
After that, I found the trace in /root/profiles/plugins/profile/2024_02_06_01_10_46/localhost_9012.xplane.pb. Then you can upload it to the tensorboard.
With that said, could you try using our cuda 12.1 docker image and retry?
Thank you very much, I upgraded to cuda12.1 and can successfully profile. But I saw such an Warning in tensorboard:
No step marker observed and hence the step time is unknown. This may happen if (1) training steps are not instrumented (e.g., if you are not using Keras) or (2) the profiling duration is shorter than the step time. For (1), you need to add step instrumentation; for (2), you may try to profile longer.
and then, Why is my tensorboard different from yours,and my task GPU FLOPS utilization is 0.0%?
this is my tensorboard:
hey @mars1248 , I did see the same warning as yours with --duration_ms 2000
. So I follow the warning "you may try to profile longer" by using a larger duration, for example 20000, then I don't see the warning anymore. However, on the op_profile view, I also see Overall GPU FLOPS utilization is 0.0%
.
I'm checking with tensorboard team on the issue Overall GPU FLOPS utilization is 0.0%
@mars1248
I wrote such a single test in gpu spmd training, hoping to profile amp training scenarios, but there is such an error RuntimeError: Expecting scope to be empty but it is train_loop.1
import torch_xla
import torch_xla.runtime as xr
import torch_xla.core.xla_model as xm
import torch_xla.experimental.xla_sharding as xs
from torch_xla.experimental.xla_sharding import Mesh
from torch_xla.amp import autocast, GradScaler
import numpy as np
import torch.optim as optim
import torch_xla.debug.profiler as xp
# Enable XLA SPMD execution mode.
xr.use_spmd()
# Assuming 4 GPU devices; PJRT_DEVICE=GPU, NUM_GPU_DEVICES=4
num_devices = xr.global_runtime_device_count()
print(num_devices)
# mesh shape will be (2,2) in this example
mesh_shape = (num_devices // 2, 2)
device_ids = np.array(range(num_devices))
# axis_names 'x' nad 'y' are optional
mesh = Mesh(device_ids, mesh_shape, ('x', 'y'))
t1 = torch.randn(16, 36, device='cpu')
xt1 = t1.to(xm.xla_device())
xs.mark_sharding(xt1, mesh, (0, 1))
# check to see if the spmd computation used all devices
print(torch_xla._XLAC._get_xla_sharding_spec(xt1))
model = torch.nn.Linear(36, 128)
model.to(xm.xla_device())
loss_func = torch.nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
target = torch.randn(1, 128)
target = target.to(xm.xla_device())
scaler = GradScaler(use_zero_grad=False)
with xp.StepTrace('train_loop', step_num=0):
with autocast(xm.xla_device()):
output = model(xt1)
loss = loss_func(output, target)
print(f'Initial loss: {loss.item()}')
scaler.scale(loss).backward()
# output = model(xt1)
# loss = loss_func(output, target)
# print(f'Initial loss: {loss.item()}')
# loss.backward()
# xm.mark_step()
gradients = xm._fetch_gradients(optimizer)
xm.all_reduce('sum', gradients, scale=1.0 / xm.xrt_world_size())
xm.mark_step()
scaler.step(optimizer)
scaler.update()
# print(gradients)```
I found that it was caused by this line of code. What is the purpose of this mark_step? https://github.com/pytorch/xla/blob/master/torch_xla/amp/grad_scaler.py#L77
The purpose of the mark_step is to make XLA compile and execute its current graph and materialize the tensors.
On another thought, can you comment out the line print(f'Initial loss: {loss.item()}')
and try again?
Also, which line in your script https://github.com/pytorch/xla/issues/6422#issuecomment-1980416414 fails?
@vanbasten23
scaler.step(optimizer) this line raise this exception RuntimeError: Expecting scope to be empty but it is train_loop.1
The error RuntimeError: Expecting scope to be empty but it is train_loop.
indicates that the code calls mark_step
within xp.StepTrace
. Plus, xp.StepTrace
add xm.mark_step() when exiting the scope per https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm#adding_traces.
And you probably want to remove the xm.mark_step()
before the scaler.step(optimizer)
while I'm looking at the example at https://github.com/pytorch/xla/blob/300bc8c1720aec38279eb6d72cbfa5e564a76ce3/test/test_train_mp_mnist_amp.py#L161.
Thank you for your answer. I have solved my problem。
Is there any way to see which op these cuda kernels are called by, preferably, it would let me see the corresponding python stack? You can see it in troch's trace
@vanbasten23
I'm checking with tensorboard team on the issue Overall GPU FLOPS utilization is 0.0%
Just want to get back to you about it. The tensorboard told me this is a feature that has not been implemented yet.
Is there any way to see which op these cuda kernels are called by, preferably
How did you get the the above profile? Also, in your screenshot, do you mean ncclKernel_AllReduce..
calls ampere_fp16_s16816...
?