awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

Improve upon EFA versions script

Open sean-smith opened this issue 1 year ago • 7 comments

This script adds libfabric, nvidia driver version and cuda version. This covers everything in efa-versions.sh so I removed that script.

$ srun python3 efa-versions.py
+--------------------------+--------------+
|  Package                 |  Version     |
+--------------------------+--------------+
|  EFA installer version:  |  1.26.1      |
+--------------------------+--------------+
|  NCCL Version            |  2.18.5      |
+--------------------------+--------------+
|  Libfabric Version       |  1.18.2      |
+--------------------------+--------------+
|  AWS OFI NCCL version:   |  1.7.3-aws   |
+--------------------------+--------------+
|  Nvidia Driver           |  535.104.12  |
+--------------------------+--------------+
|  CUDA Version:           |  12.1.105    |
+--------------------------+--------------+

And with a container image:

$ srun python3 efa-versions.py --container-image megatron-training
+--------------------------+--------------+--------------+
|  Package                 |  Local       |  Container   |
+--------------------------+--------------+--------------+
|  EFA installer version:  |  1.26.1      |  1.30.0      |
+--------------------------+--------------+--------------+
|  NCCL Version            |  2.18.5      |  None        |
+--------------------------+--------------+--------------+
|  Libfabric Version       |  1.18.2      |  1.19.0      |
+--------------------------+--------------+--------------+
|  AWS OFI NCCL version:   |  1.7.3-aws   |  None        |
+--------------------------+--------------+--------------+
|  Nvidia Driver           |  535.104.12  |  535.104.12  |
+--------------------------+--------------+--------------+
|  CUDA Version:           |  12.1.105    |  12.2.128    |
+--------------------------+--------------+--------------+

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

sean-smith avatar Apr 15 '24 17:04 sean-smith

Not ready to merge.

sean-smith avatar Apr 16 '24 11:04 sean-smith

@sean-smith ready now?

perifaws avatar Apr 29 '24 16:04 perifaws

@sean-smith when I try to run, i get

ubuntu@ip-10-1-22-213:~$ python3 check-efa.py
Traceback (most recent call last):
  File "check-efa.py", line 8, in <module>
    from prettytable import PrettyTable
ModuleNotFoundError: No module named 'prettytable'

nghtm avatar May 01 '24 16:05 nghtm

Can we run without pretty-table

nghtm avatar May 01 '24 16:05 nghtm

Can we run without pretty-table

No, just need to

sudo apt install python3.8-venv
python3 -m venv venv && source venv/bin/activate
pip install prettytable
python3 efa-versions.py

sean-smith avatar May 01 '24 17:05 sean-smith

If customer wants to run this on compute node (which they likely will), this requires the packages to be installed on compute node, which is sub optimal. Aleternatives to pretty table we can use without needing to install the package?

nghtm avatar May 02 '24 14:05 nghtm

If customer wants to run this on compute node (which they likely will), this requires the packages to be installed on compute node, which is sub optimal. Aleternatives to pretty table we can use without needing to install the package?

It's a little bit more nuanced than that - the customer will setup their virtualenv on the headnode in the FSx Lustre filesystem and then use that virtualenv from the compute nodes.

sudo apt install python3.8-venv #installs on headnode
python3 -m venv venv && source venv/bin/activate #installs on headnode
pip install prettytable # installs on headnode fsx
srun python3 efa-versions.py  # runs on compute

sean-smith avatar May 03 '24 09:05 sean-smith