Natalia Gimelshein

Results 214 comments of Natalia Gimelshein

I understand (although 1.10 didn't implement scipy exactly, it was still using `sum`s, not means, but currently we traded one set of inputs for another set of inputs, so it's...

But still, maybe at the end of the day we should just clamp outs so that they are no bigger than one. Gradient computation is a different matter, so still...

``` -- Compiler does not support SVE extension. Will not build perfkernels. -- Found CUDAToolkit: /usr/local/cuda-12.8/include (found version "12.8.93") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success --...

cudss is not a problem, misdetecting architecture (sm86+PTX instead of sm90) is

Concretely, this is wrong in cmake ``` if(CUDA_LIMIT_GPU_ARCHITECTURE AND ITEM VERSION_GREATER_EQUAL CUDA_LIMIT_GPU_ARCHITECTURE) list(GET CUDA_COMMON_GPU_ARCHITECTURES -1 NEWITEM) string(APPEND CUDA_GPU_DETECT_OUTPUT_FILTERED " ${NEWITEM}") else() string(APPEND CUDA_GPU_DETECT_OUTPUT_FILTERED " ${ITEM}") endif() ``` as it either...

Do you know how this "native" option would work later when we are checking if the build is ok for the current GPU to give a clear error message on...

> @ngimel From nvcc documentation: > > ``` > When -arch=native is specified, nvcc detects the visible GPUs on the system and generates codes for them, no PTX program will...

btw is anyone working on adding cooperative launch to triton? Without it grid sync is unsafe - the threadblocks may not all be resident on the gpu and it'll be...

I'd say cooperative launch is needed to turn on by default (could deadlock without it), `tl.load` with `acquire` is nice to have but it's just a perf optimization

@jataylo so far this functionality is not turned on by default and is just exercised in tests, you might want to throw an error if it's manually turned on. The...