awsome-distributed-training
awsome-distributed-training copied to clipboard
PyTorch-based utility to recommend env vars.
Need a PyTorch script to probe instances and software stack, then prints the necessary EFA env vars. Use this script to avoid spending hours to debug runtime crash caused by "forgetting to set env var XXX when using a combination of libXXX-versionA
and libYYY-versionB
It can be a plain Python script, as long as it can accurately probe the version of nccl, aws-ofi-nccl (nccl-net), libfabric, and efa version (dlopen or what not). The probing mechanic must be robust to whether the script is run under container (vs) host, whether env vars are overriden by job script, etc.
At the very least, the script must covers the env vars in EFA cheatsheet, +new ones (like the cuda mem sync env var for incompatibility between specific versions of nccl and efa installer).