lm-evaluation-harness Help on MultiGPU setup

I tried this first but got OOM

accelerate launch --use_fsdp --no_python lm_eval --model hf \
    --model_args pretrained=dillfrescott/trinity-v1.2-x8-MoE,load_in_4bit=True \
    --tasks ai2_arc\
    --batch_size 4 \
    --num_fewshot 25

Then I saw a solution like using parallelize=True

lm_eval --model hf --batch_size 1 --model_args pretrained=dillfrescott/trinity-v1.2-x8-MoE,load_in_4bit=True,parallelize=True --tasks arc_easy --num_fewshot 25

But I got the error like this

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Dec 28 '23 09:12 hahuyhoang411

Hi! What is the codebase commit you are running this on? And does using --device cuda change this behavior?

If this bug still persists on the current codebase then it should be quick fix to handle this properly.

Dec 29 '23 22:12 haileyschoelkopf

I am experiencing the same error, and --device cuda does not seem to work!

Jan 05 '24 07:01 guijinSON

I am experiencing the same error +1

Jan 05 '24 09:01 yuleiqin

@guijinSON @yuleiqin what model are you experiencing this with? I seem to have trouble replicating this using

accelerate launch -m lm_eval --model hf --tasks lambada_openai --model_args pretrained=EleutherAI/llemma_7b,trust_remote_code=True,load_in_4bit=true --device cuda:0

with accelerate config

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

trying on a Mixtral model now.

EDIT: can't seem to replicate on

lm_eval --model hf --batch_size auto --model_args pretrained=mistralai/Mixtral-8x7B-v0.1,load_in_4bit=True,parallelize=True --tasks arc_easy --num_fewshot 25 --limit 25

Jan 05 '24 16:01 haileyschoelkopf

I was using qwen-72b

Jan 05 '24 23:01 guijinSON

@guijinSON @yuleiqin what model are you experiencing this with? I seem to have trouble replicating this using

accelerate launch -m lm_eval --model hf --tasks lambada_openai --model_args pretrained=EleutherAI/llemma_7b,trust_remote_code=True,load_in_4bit=true --device cuda:0

with accelerate config

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

trying on a Mixtral model now.

EDIT: can't seem to replicate on

lm_eval --model hf --batch_size auto --model_args pretrained=mistralai/Mixtral-8x7B-v0.1,load_in_4bit=True,parallelize=True --tasks arc_easy --num_fewshot 25 --limit 25

@guijinSON @hahuyhoang411 I think the problem perhaps is related to "accelerate". For large LLMs, if we just use four GPU cards without using the accerate command, the function "transformers.auto_xxx_causalLM.from_pretrained" can automatically load a big model (e.g., 70B) and assign weights to different GPU cards with torch.dtype="auto". In this case, no OOM occurs. However, if we use the accelerate, it seems that at the beginning the model is loading into cpu (I do not see any increase of GPU memory when different shards are being loaded). After the model is loaded, it then assigns all weights to all 4 GPUs and OOM occurs. I do believe something wrong might happens that the weights are NOT split and assigned properly to 4 cards but just copied all to each card instead.

Jan 06 '24 07:01 yuleiqin

I think there are a few issues being conflated here and it would be helpful to disentangle them:

We support:

launching with accelerate launch, which is only meant to support Data-parallel inference (no FSDP, no splitting a model across multiple GPUs).
Launching without accelerate launch but with --model_args parallelize=True, which is meant to enable loading a single copy of the model, split across all GPUs you have available.

For all the usecases you are experiencing, the latter option should be what is used. However, my understanding is that when trying to use parallelize=True, the result is RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

I'm struggling to reproduce this error however (on the most recent version of the codebase), I will continue to see if I can find a way to replicate it.

Jan 08 '24 16:01 haileyschoelkopf

@haileyschoelkopf Thank you for the clarification running with --model_args parallelize=True works for me. It was this description at the readme that made me confused.

If your model is is too large to be run on a single one of your GPUs then you can use accelerate with Fully Sharded Data Parallel (FSDP) that splits the weights of the model across your data parallel ranks. To enable this, ensure you select YES when asked Do you want to use FullyShardedDataParallel? when running accelerate config. To enable memory-efficient loading, select YES when asked Do you want each individually wrapped FSDP unit to broadcast module parameters from rank 0 at the start?.

Jan 09 '24 12:01 guijinSON

Thank you, that makes sense! I've amended the documentation to be less confusing here, I hope that it's helpful.

The quoted paragraph there refers only to using FSDP with NO_SHARD--there's an advanced option that lets you avoid the huge surge in required CPU RAM resulting from loading a copy of a model from CPU into GPU on initialization, when doing data-parallel evaluation.

@hahuyhoang411 @yuleiqin does this linked docs change #1261 clear anything up? does the issue persist when using either parallelize=True or accelerate launch only, and when disabling FSDP in accelerate config ?

Jan 09 '24 15:01 haileyschoelkopf

I think there are a few issues being conflated here and it would be helpful to disentangle them:

We support:

launching with accelerate launch, which is only meant to support Data-parallel inference (no FSDP, no splitting a model across multiple GPUs).

Launching without accelerate launch but with --model_args parallelize=True, which is meant to enable loading a single copy of the model, split across all GPUs you have available.

For all the usecases you are experiencing, the latter option should be what is used. However, my understanding is that when trying to use parallelize=True, the result is RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

I'm struggling to reproduce this error however (on the most recent version of the codebase), I will continue to see if I can find a way to replicate it.

Hi I think I am experiencing tha same error using parallelize=True again. I tried both 0.4.0 and 0.4.1 version of the codebase and they were same.

My command was lm_eval --model "hf" --model_args pretrained=/home/xyf/workspace/pythia --task winogrande --batch_size 8 --num_fewshot 5 --verbosity "DEBUG" --output_path outputs --device cuda, trying to do Model-parallel to load a single copy of the model which is too big to load in single GPU. My GPU environment was 4* A800(80G). The result is

  File "/home/fanyuxuan/anaconda3/envs/eval-vanilla/lib/python3.10/site-packages/torch/nn/functional.py", line 2237, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

which occurred when Running loglikelihood request....

Mar 14 '24 03:03 feiba54

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Help on MultiGPU setup

lm-evaluation-harness
lm-evaluation-harness copied to clipboard