Natalia Gimelshein issues

Results 11 issues of


Natalia Gimelshein

hack ln implementation in convnext

See https://github.com/pytorch/pytorch/issues/71465 Slightly changes LayerNorm2d implementation, 1) currently when ln2d is called on a contiguous tensor, it accidentally turns most of the network into channels last mode, line 114 undoes...

Gated unit formulation

Original PixelCNN paper (https://arxiv.org/pdf/1606.05328.pdf) uses gated unit defined as tanh(a) * sigmoid(b). Same formulation of gated unit is used in the wavenet paper. Yet here you switched to gated unit...

Ptxas and newer hardware compatibility

We are getting issues like https://github.com/pytorch/pytorch/issues/90170 and we'll be getting more of them, where people try to use new hardware (4090 in this case) with old toolkit (and hence old...

enhancement

Index computations are done in int32 even for large tensors, and overflow causing IMA

Repro: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile()...

bug

`atomic_add` for fp16 non-deterministically segfaults during compilation

bug

wrong results when argmax and masking is used and num_warps=1

Repro below. The same kernel is called with AOT and jit compilation, and AOT produces wrong result. The mask, instead of having a single True element (it's computed as `...

bug

No `other` branch in tl.load leads to invalid results, even though all other values are masked out

Repro below. Generated ptx looks valid in both cases, with only difference in movs with `@!pxx` as expected . Happens with fp16, float32 is ok. I'm deliberately setting `other` to...

bug

Wrong results where one of the args is assigned to constant inside the kernel

This might be related to #714. Repro (comments inside, requires torchdynamo unfortunately), tl;dr if the kernel has `xnumel=` where `xnumel` is also a kernel arg, and is equal to the...

`//` in triton does trunc division (c-style integer division, rounding to 0), not floor, like python

This matters only for operands of the different sign that are not exactly divisible. repro ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided...

bug

don't swallow cuda errors

Fixes #91758 I'm not in love with casting cudaError to and from int, but I couldn't avoid it without major refactors, and we need to fix this bug soon.