accelerate [feature request] accelerate launcher: add numa affiinities control

trafficstars

Feature request

As explained here https://github.com/pytorch/pytorch/issues/115305 when using 2-cpu nodes it's important to get the NUMA affinities right to avoid cross NUMA node talk

As torchrun currently doesn't support it a workaround was posted here https://github.com/pytorch/pytorch/issues/115305#issuecomment-1845957682 and it includes a torchrun flag --no-python which the accelerate launcher doesn't have.

So any suggestions to how I could use this script with accelerate?

For simplicity here is the solution for torchrun:

trampoline.sh

#!/usr/bin/bash

# Query the bus ID for device LOCAL_RANK
BUS_ID=$(nvidia-smi --query-gpu=pci.bus_id -i $LOCAL_RANK --format=csv,noheader)
BUS_ID=${BUS_ID,,}

# Find the numa node for device LOCAL_RANK
NODE=$(cat /sys/bus/pci/devices/${BUS_ID:4}/numa_node)

echo "Starting local rank $RANK on numa node $NODE"
numactl --cpunodebind=$NODE --membind=$NODE "$@"

torchrun --nproc_per_node=8 --monitor-interval=1 --no-python ./trampoline.sh python3 -c "print('hello')"

update: I shared @yifuwang's workaround at https://twitter.com/StasBekman/status/1734724979496018215

If pynvml dependency is OK someone posted a python solution: https://github.com/NVIDIA/DeepLearningExamples/blob/9dd9fcb98f56187e49c5ee280cf8dbd530dde57b/TensorFlow2/LanguageModeling/BERT/gpu_affinity.py so that would probably be easier to integrate into the launcher.

Thanks.

Dec 13 '23 00:12 stas00

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 12 '24 15:01 github-actions[bot]

keepalive

Jan 12 '24 18:01 stas00

keepalive

Feb 07 '24 16:02 stas00

Will be looking into adding this week finally :)

Mar 03 '24 15:03 muellerzr

that's exciting, Zach - thank you!

Mar 04 '24 05:03 stas00

@stas00 actually wait, can't we just do this?

accelerate launch --no_python --multi_gpu --num_processes 8 --monitor_interval=1 ./trampoline.sh python3 -c "print('hello')"

(I just pulled all this from accelerate launch -h)

Or am I missing something?...

Or, with a config file:

accelerate launch --no_python --monitor_interval=1 ./trampoline.sh python3 -c "print('hello')"

Mar 04 '24 15:03 muellerzr

Note I'll also be making a PR to make you able to do both - and _ for params I think, since that seems to be the root cause of a lot of confusion with our CLI 😓

Edit; https://github.com/huggingface/accelerate/pull/2525 will make it possible to do this OOTB with no arg fixes, as the root cause was - vs _

Mar 04 '24 16:03 muellerzr

if it works - then fantastic - let's document this then please and ideally move --no_python to the end just before the no-python code as it's easier to comprehend the connection (IMHO), that is:

accelerate launch --multi_gpu --num_processes 8 --monitor_interval=1 --no_python ./trampoline.sh python3 -c "print('hello')"
accelerate launch --monitor_interval=1 --no_python ./trampoline.sh python3 -c "print('hello')"

Mar 04 '24 20:03 stas00

Zach, I think the trouble with the workaround solution is that the user won't have trampoline.sh and they would have to get it from somewhere for each new setup.

I think a much better solution would be for the framework to have an independent solution that is provided by its core.

Mar 04 '24 23:03 stas00

Sure @stas00, I can agree on that and we can extend launch probably to help. I'm unfamiliar with numactl so any assistance in explaining so I can wrap my head around what's happening here would help. Let's stick with torchrun.

So we have the bash script:

#!/usr/bin/bash

# Query the bus ID for device LOCAL_RANK
BUS_ID=$(nvidia-smi --query-gpu=pci.bus_id -i $LOCAL_RANK --format=csv,noheader)
BUS_ID=${BUS_ID,,}

# Find the numa node for device LOCAL_RANK
NODE=$(cat /sys/bus/pci/devices/${BUS_ID:4}/numa_node)

echo "Starting local rank $RANK on numa node $NODE"
numactl --cpunodebind=$NODE --membind=$NODE "$@"

And the command:

torchrun --nproc_per_node=8 --monitor-interval=1 --no-python ./trampoline.sh python3 -c "print('hello')"

So torchrun will execute trampoline.sh, and we pipe in the python3 part to numctl via the "$@" there.

Or, with assuming environment is setup properly, we are running:

torchrun \
  --nproc_per_node=8 \
  --monitor-interval=1 \
  --no-python \
  numactl \
  --cpunodebind=$NODE \
  --membind=$NODE \
  python3 -c "print('hello')"

Is this a valid understanding of what we have going on?

Mar 05 '24 00:03 muellerzr

Tbh though, the pynvml solution makes more sense, we can add it as a CLI option and just raise an err if it's not installed. Let me work on that real quick

Mar 05 '24 01:03 muellerzr

Is this a valid understanding of what we have going on?

That looks correct. the NODE is the numa node, of course.

and you need to check if numactl exists - as it's not normally installed on Linux. Need to install it.

re: pynvml

I'm pretty sure it's NVIDIA only - i.e. it won't work on anything but NVIDIA - as AMD MI300X and Intel Gaudi2 are emerging this solution won't work there, but numactl will
beware that pynvml ignores CUDA_VISIBLE_DEVICES - where I am using it I had to use a workaround: https://github.com/stas00/ipyexperiments/blob/569a450e204b1da60e5ef07a96c91553b286ea14/ipyexperiments/utils/mem.py#L33-L44
and is Accelerate linux only? to work on osx I had to do https://github.com/stas00/ipyexperiments/blob/569a450e204b1da60e5ef07a96c91553b286ea14/ipyexperiments/utils/pynvml_gate.py#L5-L15

Mar 05 '24 01:03 stas00

It is not, looks like we'll need to do it the hard way without pynvml (and just run a series of bash things) given that.

Mar 05 '24 01:03 muellerzr

but surely what bash is doing can be done in python and python has the numa API functionality - which is the pynvml script w/o the pynvml code in it.

So I think it'd be much cleaner and user-friendlier not to do it at CLI level.

Mar 05 '24 01:03 stas00

re-checked, it's NVIDIA only:

pynvml - Python bindings to the NVIDIA Management Library.

Mar 05 '24 01:03 stas00

No worries, while un-fun, I'm getting it working with some subprocess ;)

Mar 05 '24 01:03 muellerzr

@stas00 if you want to try some bleeding edge stuff, just pushed some commits. Haven't fully tested it on a multi-gpu system yet, but at least the dry run of the commands looks like everything should have been setup properly:

pip install git+https://github.com/huggingface/accelerate@affinity

To use:

accelerate launch --multi_gpu --num_processes 8 --enable_numa_affinity myscript.py --arg1=1 ...

I'll be able to fully test in the AM :) (and enable it via config file, etc, etc)

Mar 05 '24 02:03 muellerzr

hmm, I wasn't paying attention to the bash callouts - but looking now closely it's still nvidia-dependent because of BUS_ID=$(nvidia-smi --query-gpu=pci.bus_id -i $LOCAL_RANK --format=csv,noheader) - so for AMD it'd be rocm-smi (need to check the args) and for gaudi2 I don't know - we should ask the optimum folks.

So the subprocess callout is probably simpler anyway, than making accelerate depend on pynvml, and various other libraries for each architecture? It might be inevitable in the long run as the number of vendors explodes, but for now it's probably easier to first figure out from pytorch which vendor it is and then "switch" into the corresponding busid code.

Mar 05 '24 16:03 stas00

Let's start small with the nvidia version, then we can add the AMD and gaudi2 as follow ups. (Since we can only test the nvidia-smi version rn)

Mar 07 '24 12:03 muellerzr

@stas00 please see https://github.com/huggingface/accelerate/pull/2535 :)

Mar 07 '24 15:03 muellerzr

accelerate accelerate copied to clipboard

[feature request] accelerate launcher: add numa affiinities control

Feature request

accelerate
accelerate copied to clipboard