mars1248 comments

Results 10 comments of


                                            mars1248

torch xla gpu training unable to profile

@vanbasten23 Hello, what's going on? Could it be that my cuda version is too old? Does cupti for torch xla require 12.0+ cuda?

torch xla gpu training unable to profile

> I'm using cuda 12.1 and I didn't see the error. > > I got the trace this way: > > ``` > # in my container > root@xiowei-gpu:/ansible# PJRT_DEVICE=CUDA...

torch xla gpu training unable to profile

I wrote such a single test in gpu spmd training, hoping to profile amp training scenarios, but there is such an error `RuntimeError: Expecting scope to be empty but it...

torch xla gpu training unable to profile

I found that it was caused by this line of code. What is the purpose of this mark_step? https://github.com/pytorch/xla/blob/master/torch_xla/amp/grad_scaler.py#L77

torch xla gpu training unable to profile

@vanbasten23 scaler.step(optimizer) this line raise this exception `RuntimeError: Expecting scope to be empty but it is train_loop.1`

torch xla gpu training unable to profile

Thank you for your answer. I have solved my problem。 Is there any way to see which op these cuda kernels are called by, preferably, it would let me see...

How to do multi-machine spmd training？

@vanbasten23 Thank you for your answer, I have multi-machine distributed training running. I would like to ask how to apply amp(automatic mixing precision) to spmd training mode?

How to do multi-machine spmd training？

@JackCaoG 我按照下面的文档操作，但遇到了这样一个问题。AssertionError: No inf checks were recorded for this optimizer.可以提供一些排查的思路吗？ > 不管有没有SPMD，AMP逻辑应该是相同的，你可以看看https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu

How to do multi-machine spmd training？

> @JackCaoG 我按照下面的文档操作，但遇到了这样一个问题。AssertionError: No inf checks were recorded for this optimizer.可以提供一些排查的思路吗？ > > > 不管有没有SPMD，AMP逻辑应该是相同的，你可以看看https://github.com/pytorch/xla/blob/master/docs/amp.md#amp-for-xlagpu @JackCaoG 我定位到是训练时，模型参数的grad都是none导致的问题。 ![image](https://github.com/pytorch/xla/assets/62137145/d5a2cee8-f593-43f5-9024-70dfed4939aa)

How to use Flash Attention x TFPlus?

You can refer to this commit to use MultiHeadAttention like in TensorFlow. https://github.com/intelligent-machine-learning/dlrover/pull/850