ipex-llm Flex 170 x8 is failing when targeting 6 or 8 GPUs

Observed this issue when attempting to run Llama2-7B 32 x32 token inference on Flex170 x8 DUT. For reference, this DUT is accessible followng the instructions here -- Welcome to the ISE Lab - ISE Team Lab Operations - Intel Enterprise Wiki.

Current System config and specs are as mentioned here-- https://wiki.ith.intel.com/display/MediaWiki/Flex-170x8+%28Inspur+-+ICX%29+Qualification

Script used (modified version of run_vicuna_33b_arc_2_card.sh (from - https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/Deepspeed-AutoTP/README.md)

Failure is seen under this portion-- (some overlap exists in images)

Apr 03 '24 17:04 gbertulf

Hi @gbertulf, you cannot run autotp if attention head number is not divisible by the number of GPUs as explained in the error message. In this case, the attention head number is 32 and the number of GPUs is 6.

Apr 03 '24 20:04 yangw1234

Hi Yang, is there a work around to get 6 or 8 GPUs working?

Apr 03 '24 22:04 gbertulf

Hi Yang, is there a work around to get 6 or 8 GPUs working?

The attention head number is model dependent, you can try to find a model with the head number being a multiple of 6 for 6 GPUs.

For 8 GPUs, head number 32 should work.

Apr 03 '24 22:04 yangw1234

Hi Yang, going back to 8 GPUs on Flex with 32 attention head number, I reran on the same platform and verified this info when i did print(model) -- 32 attention head number

However, even though 8 GPU is a multiple of 32 attention head number, the execution still failed--

Wondering if there are any intermediate step to do or changes in config or settings to apply?

Apr 08 '24 20:04 gbertulf

Hi Yang, going back to 8 GPUs on Flex with 32 attention head number, I reran on the same platform and verified this info when i did print(model) -- 32 attention head number

However, even though 8 GPU is a multiple of 32 attention head number, the execution still failed--

Wondering if there are any intermediate step to do or changes in config or settings to apply?

Hi @gbertulf , would you mind providing the full output log in text format?

Apr 08 '24 20:04 yangw1234

Hi Yang, please see attached output text file-- 8GPUs_llama2_7B.txt 8GPUs_llama2_7B.txt

Apr 08 '24 20:04 gbertulf

Hi Yang, please see attached output text file-- 8GPUs_llama2_7B.txt 8GPUs_llama2_7B.txt

@gbertulf if you are loading the model in FP32, it could be the case that all 8 model are loading into cpu at the same time and there is not enough cpu memory. Can you check the total cpu memory size in your system and try monitoring the CPU memory usage when running the application.

Apr 09 '24 22:04 yangw1234

Hi Yang, updating.

The CPU memory does hit max utilization when running 8GPUs with Vicuna 33B model on the DUT--

Please see attached vicuna 33B full txt output log attached -- 8GPUs_vicuna33B.txt

When running with 8GPUs with llama2_7B + FP16, it does not hit max memory utilization, it is challenging to capture a screenshot but I can show it to you when we have a quick sync or quick debug call.

Apr 12 '24 05:04 gbertulf

Hi Yang, from our debug synch you indicated that on the same machine your fellow team member were not seeing issues on 8-GPU config. May I kindly ask for the BKM or steps taken?

On my end, using the reference script from the repo site, I only modified the number of GPUs to show 8. Were there any other intermediate steps needed?

export MASTER_ADDR=127.0.0.1 export FI_PROVIDER=tcp export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets

export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so:${LD_PRELOAD} basekit_root=/opt/intel/oneapi source $basekit_root/setvars.sh --force source $basekit_root/ccl/latest/env/vars.sh --force

NUM_GPUS=8 # number of used GPU export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export TORCH_LLM_ALLREDUCE=0 # Different from PVC

mpirun -np $NUM_GPUS --prepend-rank
python deepspeed_autotp_mod.py --repo-id-or-model-path '/home/gta/glen/Llama-2-7b-chat-hf' --low-bit 'sym_int4'

Apr 17 '24 17:04 gbertulf

Hi Yang, from our debug synch you indicated that on the same machine your fellow team member were not seeing issues on 8-GPU config. May I kindly ask for the BKM or steps taken?

On my end, using the reference script from the repo site, I only modified the number of GPUs to show 8. Were there any other intermediate steps needed?

export MASTER_ADDR=127.0.0.1 export FI_PROVIDER=tcp export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets

export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so:${LD_PRELOAD} basekit_root=/opt/intel/oneapi source $basekit_root/setvars.sh --force source $basekit_root/ccl/latest/env/vars.sh --force

NUM_GPUS=8 # number of used GPU export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export TORCH_LLM_ALLREDUCE=0 # Different from PVC

mpirun -np $NUM_GPUS --prepend-rank python deepspeed_autotp_mod.py --repo-id-or-model-path '/home/gta/glen/Llama-2-7b-chat-hf' --low-bit 'sym_int4'

@Uxito-Ada are there any other intermediate steps needed?

Apr 17 '24 18:04 yangw1234

Hi Yang, from our debug synch you indicated that on the same machine your fellow team member were not seeing issues on 8-GPU config. May I kindly ask for the BKM or steps taken? On my end, using the reference script from the repo site, I only modified the number of GPUs to show 8. Were there any other intermediate steps needed? export MASTER_ADDR=127.0.0.1 export FI_PROVIDER=tcp export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so:${LD_PRELOAD} basekit_root=/opt/intel/oneapi source $basekit_root/setvars.sh --force source $basekit_root/ccl/latest/env/vars.sh --force NUM_GPUS=8 # number of used GPU export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export TORCH_LLM_ALLREDUCE=0 # Different from PVC mpirun -np $NUM_GPUS --prepend-rank python deepspeed_autotp_mod.py --repo-id-or-model-path '/home/gta/glen/Llama-2-7b-chat-hf' --low-bit 'sym_int4'

@Uxito-Ada are there any other intermediate steps needed?

Hi @yangw1234 , no further step is needed.

I think it could because of OOM, as in the log, the programs stop after print(model) and before tokenzier.from_pretrained, thus maybe loading the tokenizer takes much memory.

Apr 18 '24 02:04 Uxito-Ada

Issue is resolved. Closing this ticket. Thank you team for your help.

May 10 '24 16:05 gbertulf

ipex-llm ipex-llm copied to clipboard

Flex 170 x8 is failing when targeting 6 or 8 GPUs

ipex-llm
ipex-llm copied to clipboard