Jason Ansel

Results 199 comments of Jason Ansel

Ah I think you are correct. The autotuner runs the kernel multiple times, which is not safe to do for in-place kernels because the first run clobbers the input data....

`tl.log()` still throws the same `LLVM ERROR: Broken function found, compilation aborted!` given a float64 tensor. This is annoying, but has the easy workaround of changing it to `tl.libdevice.log`. With...

I am fine with using fast hardware approximations by default. So I'd even suggest defaulting `use_fast_math=True` (perhaps you meant to write that in your example?). A lot of the time...

@ptillet this may be related to #574 as they are both issues with 1xN or Nx1 blocks. This issue is forcing us to generate worse/slower code (turning Nx1 blocks into...

> Yes, this was actually a very minor issue. I have it fixed, but will merge along with the atomic_add and the rand constexpr fix tonight probably :) Awesome! Thanks...

I believe so, but I'll let @pyjhzwh confirm. This one is pretty awkward to workaround, because there are few different ways to write a broadcasting load: 1) reshape the index...

I'm hitting the same thing on cuda 11.6 with latest master, so not cuda 11.4 specific. Clean build doesn't seem to help.