Natalia Gimelshein
Natalia Gimelshein
`nvrtc` doesn't have explicit APIs to handle ptx -> sass compilation, it can compile cuda C code to sass or ptx depending on the options passed to `nvrtcCompileProgram`. Here we'd...
`ptxas`, `nvrtc` and friends don't have a query for this unfortunately, so we just have a bunch of conditionals manually listing max version from the docs.
I replaced `12544` literal with kernel arg, to make generated code simpler, and got the following ptx ``` ld.param.u32 %r27, [kernel1_0d1d2d3d4d_param_4]; shl.b32 %r28, %r26, 2; mov.u32 %r29, %ctaid.x; shl.b32 %r30,...
`int64_t` indexing in cuda kernels is slower, but not dramatically so, ~15% penalty I'd say. Also, there are no multiplications here, only divisions, and that works in c++: ``` uint32_t...
Even when I annotate the arg as i64, I'm still getting IMA: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache...
Ugh sorry, yeah, you can just remove `import TileHint` line.
Yes, the problem still exists on master. #745 is different, I hit #745 because I'm forced to use `other` even though I don't think I need to (other values should...
Yeah it's possible that `undefined` values are loaded, but they are masked out in the following lines so it shouldn't matter whether there's an `other` specified or not, that's what...
`xmask` is always True in this case (I launch just 1 program_id, so maximum `xindex` is 1023, `xnumel` is much larger). So the same mask, `tmp33` is used for read...
Yeah I noticed that removing `xmask` changes results (I don't remember if it always fixed the problem), but that was very surprising too, as `xmask` is always true here.