OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

AMD LayerNorm Seg Fault in PyTorch

Open xw285cornell opened this issue 1 year ago • 2 comments

❓ The question

Hi, i'm from the PyTorch team and I'm recently aware that we need some customization in layer norm, because it'll seg fault without bias: https://github.com/allenai/OLMo/blob/cf121084409d844e4f540b7d08b8f37bbe1eec98/olmo/model.py#L203. I wonder if this is already resolved? I'm trying this repro on the current pytorch and it seems to run just fine:

import torch
assert torch.version.hip is not None
input = torch.randn(10, 10, 10).cuda()
ln = torch.nn.LayerNorm([10, 10], bias=False).cuda()
ln(input).sum().backward()
print(ln.weight.grad)
assert ln.bias is None

xw285cornell avatar Feb 06 '24 19:02 xw285cornell

@dirkgr may know more about this.

To be clear, you attempted to repro this on current pytorch in AMD GPUs?

2015aroras avatar Feb 08 '24 17:02 2015aroras

As far as I know this was a ROCm-only problem, and AMD already fixed it. If it works with Torch+ROCm 5.7, I would consider this finished. In fact, we could get rid of that extra class.

We don't run models with this LN configuration anymore anyways, so it doesn't matter for us, but it was quite a difficult bug to track down at the time.

dirkgr avatar Feb 14 '24 22:02 dirkgr