Leo Fang comments

Results 1175 comments of


                                            Leo Fang

gpu/cpu mutex naming

For cpu-gpu mutex: In this recipe I showed that it's possible to do a per-recipe mutex: https://github.com/nsls-ii-forge/qulacs-feedstock/blob/b8afd0011e9412703800b1c69f8b85eff00cffd5/recipe/meta.yaml#L56-L63 It's similar to the `mpi` mutex, except that it's done on a per-recipe...

how to match conda-forge CUDA to the local system when working on compute clusters with modules

conda-forge CUDA packages are self-contained. You don't need to do `module load` or anything like that to set up a CUDA environment. As long as all compute nodes have a...

how to match conda-forge CUDA to the local system when working on compute clusters with modules

`CONDA_OVERRIDE_CUDA` allows you to pretend you have a different driver version (which is the basis of `__cuda` virtual package) from the one installed in the system. It may or may...

how to match conda-forge CUDA to the local system when working on compute clusters with modules

Or even worse, don't have a physical GPU or the driver installed. So, yes, in this case it makes sense to use the env var to build an environment.

Batched `cupy.sum` on short reduction axes are slow

Note to self: This is a known perf weakness that the upstream will address in https://github.com/NVIDIA/cub/pull/578.

Batched `cupy.sum` on short reduction axes are slow

> Note to self: This is a known perf weakness that the upstream will address in [NVIDIA/cub#578](https://github.com/NVIDIA/cub/pull/578). @jrhemstad @gevtushenko Is https://github.com/NVIDIA/cccl/pull/3969 enough to fix this perf issue, or we need...

Batched `cupy.sum` on short reduction axes are slow

@srinivasyadav18 told me offline that the new fixed-size CUB reduction should be all we need. Setting this to blocked so that I remember to circle back after updating the CCCL...

RFC: A virtual package which detects CUDA compute capability

This is exactly how `__cuda` is implemented: https://github.com/conda/conda/blob/5ebfc3e7cf4511794fa352d183062e5147d808d8/conda/plugins/virtual_packages/cuda.py#L108 Calling `nvidia-smi` is slightly worse: - under the hood, it still loads `libcuda` and other DSOs - you need a subprocess which...

when to drop (or add) ppc support

FWIW CUDA dropped ppc64le support entirely by CUDA 12.5. In my (personal) opinion the maintenance overhead of ppc packages on conda-forge is too high and we should drop it too...

add ndenumerate function for cupy.

Hi, could you please push code to the same branch so that the changes are contained in the same PR, instead of creating multiple PRs?