OLMo AMD LayerNorm Seg Fault in PyTorch

❓ The question

Hi, i'm from the PyTorch team and I'm recently aware that we need some customization in layer norm, because it'll seg fault without bias: https://github.com/allenai/OLMo/blob/cf121084409d844e4f540b7d08b8f37bbe1eec98/olmo/model.py#L203. I wonder if this is already resolved? I'm trying this repro on the current pytorch and it seems to run just fine:

import torch
assert torch.version.hip is not None
input = torch.randn(10, 10, 10).cuda()
ln = torch.nn.LayerNorm([10, 10], bias=False).cuda()
ln(input).sum().backward()
print(ln.weight.grad)
assert ln.bias is None

Feb 06 '24 19:02 xw285cornell

@dirkgr may know more about this.

To be clear, you attempted to repro this on current pytorch in AMD GPUs?

Feb 08 '24 17:02 2015aroras

As far as I know this was a ROCm-only problem, and AMD already fixed it. If it works with Torch+ROCm 5.7, I would consider this finished. In fact, we could get rid of that extra class.

We don't run models with this LN configuration anymore anyways, so it doesn't matter for us, but it was quite a difficult bug to track down at the time.

Feb 14 '24 22:02 dirkgr