mars1248

Results 10 comments of mars1248

@vanbasten23 Hello, what's going on? Could it be that my cuda version is too old? Does cupti for torch xla require 12.0+ cuda?

> I'm using cuda 12.1 and I didn't see the error. > > I got the trace this way: > > ``` > # in my container > root@xiowei-gpu:/ansible# PJRT_DEVICE=CUDA...

I wrote such a single test in gpu spmd training, hoping to profile amp training scenarios, but there is such an error `RuntimeError: Expecting scope to be empty but it...

I found that it was caused by this line of code. What is the purpose of this mark_step? https://github.com/pytorch/xla/blob/master/torch_xla/amp/grad_scaler.py#L77

@vanbasten23 scaler.step(optimizer) this line raise this exception `RuntimeError: Expecting scope to be empty but it is train_loop.1`

Thank you for your answer. I have solved my problem。 Is there any way to see which op these cuda kernels are called by, preferably, it would let me see...

@vanbasten23 Thank you for your answer, I have multi-machine distributed training running. I would like to ask how to apply amp(automatic mixing precision) to spmd training mode?

@JackCaoG 我按照下面的文档操作,但遇到了这样一个问题。AssertionError: No inf checks were recorded for this optimizer.可以提供一些排查的思路吗? > 不管有没有SPMD,AMP逻辑应该是相同的,你可以看看https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu

> @JackCaoG 我按照下面的文档操作,但遇到了这样一个问题。AssertionError: No inf checks were recorded for this optimizer.可以提供一些排查的思路吗? > > > 不管有没有SPMD,AMP逻辑应该是相同的,你可以看看https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu @JackCaoG 我定位到是训练时,模型参数的grad都是none导致的问题。 ![image](https://github.com/pytorch/xla/assets/62137145/d5a2cee8-f593-43f5-9024-70dfed4939aa)

You can refer to this commit to use MultiHeadAttention like in TensorFlow. https://github.com/intelligent-machine-learning/dlrover/pull/850