Leo Fang

Results 1175 comments of Leo Fang

Accessing `Device().compute_capability` is being addressed in #459. Let me re-label this issue to track the remaining binding performance issue.

@rwgk reported that `cuDriverGetVersion` is also sluggish when called repeatedly in a busy loop

Yes, see https://github.com/NVIDIA/cuda-python/issues/439#issuecomment-2673234572. Right now the problem is in cuda.bindings, not cuda.core. I had changed the issue label to reflect this status.

Wow! Great findings Vlad! It is insane how slow `IntEnum` (or any `Enum`-subclasses from the standard library) is... I wonder if it makes sense to build an internal cache ourselves?...

(Your fast path is also reasonable FWIW, just wonder if this is worth our efforts.)

My take from comparing version 1 and version 2 is that we wasted 100% overhead (60->120ns) just to create a tuple... We may want to think seriously about breaking the...

> My take from comparing version 1 and version 2 is that we wasted 100% overhead (60->120ns) just to create a tuple... I read it wrong. Creating the return tuple...

> Build https://github.com/cupy/cupy/pull/8412 from source To unblock myself I've force-pushed back to the snapshot from yesterday, but starting from this commit https://github.com/cupy/cupy/commit/08e6a3c63fe734c259a83f76381ed973d99cd7bd it should be reproducible. By tracing the code...

Hi Bernhard, thanks for your reply. > which relies on `thrust::less` having an actual `operator()`. We changed this in https://github.com/NVIDIA/cccl/pull/1872 > ... > (we cannot do this yet, because we...

I had a somewhat lengthy discussion with Jake offline. Below is a summary of what I asked regarding specializing `thrust::less` vs `thrust::less::operator` noted above (and the offline thread), for posterity:...