Jason Ansel
Jason Ansel
@pytorchbot merge
@pytorchbot merge
Ah I think you are correct. The autotuner runs the kernel multiple times, which is not safe to do for in-place kernels because the first run clobbers the input data....
`tl.log()` still throws the same `LLVM ERROR: Broken function found, compilation aborted!` given a float64 tensor. This is annoying, but has the easy workaround of changing it to `tl.libdevice.log`. With...
I am fine with using fast hardware approximations by default. So I'd even suggest defaulting `use_fast_math=True` (perhaps you meant to write that in your example?). A lot of the time...
@ptillet this may be related to #574 as they are both issues with 1xN or Nx1 blocks. This issue is forcing us to generate worse/slower code (turning Nx1 blocks into...
> Yes, this was actually a very minor issue. I have it fixed, but will merge along with the atomic_add and the rand constexpr fix tonight probably :) Awesome! Thanks...
I believe so, but I'll let @pyjhzwh confirm. This one is pretty awkward to workaround, because there are few different ways to write a broadcasting load: 1) reshape the index...
I'm hitting the same thing on cuda 11.6 with latest master, so not cuda 11.4 specific. Clean build doesn't seem to help.
@pytorchbot merge