Improve upon EFA versions script
This script adds libfabric, nvidia driver version and cuda version. This covers everything in efa-versions.sh so I removed that script.
$ srun python3 efa-versions.py
+--------------------------+--------------+
| Package | Version |
+--------------------------+--------------+
| EFA installer version: | 1.26.1 |
+--------------------------+--------------+
| NCCL Version | 2.18.5 |
+--------------------------+--------------+
| Libfabric Version | 1.18.2 |
+--------------------------+--------------+
| AWS OFI NCCL version: | 1.7.3-aws |
+--------------------------+--------------+
| Nvidia Driver | 535.104.12 |
+--------------------------+--------------+
| CUDA Version: | 12.1.105 |
+--------------------------+--------------+
And with a container image:
$ srun python3 efa-versions.py --container-image megatron-training
+--------------------------+--------------+--------------+
| Package | Local | Container |
+--------------------------+--------------+--------------+
| EFA installer version: | 1.26.1 | 1.30.0 |
+--------------------------+--------------+--------------+
| NCCL Version | 2.18.5 | None |
+--------------------------+--------------+--------------+
| Libfabric Version | 1.18.2 | 1.19.0 |
+--------------------------+--------------+--------------+
| AWS OFI NCCL version: | 1.7.3-aws | None |
+--------------------------+--------------+--------------+
| Nvidia Driver | 535.104.12 | 535.104.12 |
+--------------------------+--------------+--------------+
| CUDA Version: | 12.1.105 | 12.2.128 |
+--------------------------+--------------+--------------+
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Not ready to merge.
@sean-smith ready now?
@sean-smith when I try to run, i get
ubuntu@ip-10-1-22-213:~$ python3 check-efa.py
Traceback (most recent call last):
File "check-efa.py", line 8, in <module>
from prettytable import PrettyTable
ModuleNotFoundError: No module named 'prettytable'
Can we run without pretty-table
Can we run without pretty-table
No, just need to
sudo apt install python3.8-venv
python3 -m venv venv && source venv/bin/activate
pip install prettytable
python3 efa-versions.py
If customer wants to run this on compute node (which they likely will), this requires the packages to be installed on compute node, which is sub optimal. Aleternatives to pretty table we can use without needing to install the package?
If customer wants to run this on compute node (which they likely will), this requires the packages to be installed on compute node, which is sub optimal. Aleternatives to pretty table we can use without needing to install the package?
It's a little bit more nuanced than that - the customer will setup their virtualenv on the headnode in the FSx Lustre filesystem and then use that virtualenv from the compute nodes.
sudo apt install python3.8-venv #installs on headnode
python3 -m venv venv && source venv/bin/activate #installs on headnode
pip install prettytable # installs on headnode fsx
srun python3 efa-versions.py # runs on compute