Verdi March

Results 21 comments of Verdi March

ok, in that case I can propose these forward options. 1/ merge this PR as is, as the most cautious approach to eliminate the errors. 2/ make this update optional...

Here's how I repro the issue. GPU driver Version: 535.161.08 Docker version 26.1.1, build 4cf5afa Run docker (unprivileged): `docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --device=/dev/infiniband --device=/dev/gdrdrv -it...

To rebase once #314 merged.

> I'd like to see this give more structured data i.e allow it to be used within other python scripts my quick 2c: for this purpose, it's best to re-implement...

This template provides a collection of recipes. It's not meant to be plug-and-play, but cherry picked. And it's intentionally very verbose, and up to adopter to tone it down. On...

Need PyTorch script. It serves a different purpose that the efa-versions.sh (which is to probe probing what's available on disk). Rather, it should use the libraries that PyTorch script will...

Here's the difference between what PyTorch actually uses (vs) what the AMI pre-installed system-wide. My DLAMI provides cuda-12.1 (default) with nccl-2.18.5. Simply probing what's installed system wide is useful (or...

In practice, does this happen only for certain PyTorch build? Has it ever happened with nccl-tests?