ipex-llm
ipex-llm copied to clipboard
Flex 170 x8 is failing when targeting 6 or 8 GPUs
Observed this issue when attempting to run Llama2-7B 32 x32 token inference on Flex170 x8 DUT. For reference, this DUT is accessible followng the instructions here -- Welcome to the ISE Lab - ISE Team Lab Operations - Intel Enterprise Wiki.
Current System config and specs are as mentioned here-- https://wiki.ith.intel.com/display/MediaWiki/Flex-170x8+%28Inspur+-+ICX%29+Qualification
Script used (modified version of run_vicuna_33b_arc_2_card.sh (from - https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/Deepspeed-AutoTP/README.md)
Failure is seen under this portion-- (some overlap exists in images)
Hi @gbertulf, you cannot run autotp if attention head number is not divisible by the number of GPUs as explained in the error message. In this case, the attention head number is 32 and the number of GPUs is 6.
Hi Yang, is there a work around to get 6 or 8 GPUs working?
Hi Yang, is there a work around to get 6 or 8 GPUs working?
The attention head number is model dependent, you can try to find a model with the head number being a multiple of 6 for 6 GPUs.
For 8 GPUs, head number 32 should work.
Hi Yang, going back to 8 GPUs on Flex with 32 attention head number, I reran on the same platform and verified this info when i did print(model) -- 32 attention head number
However, even though 8 GPU is a multiple of 32 attention head number, the execution still failed--
Wondering if there are any intermediate step to do or changes in config or settings to apply?
Hi Yang, going back to 8 GPUs on Flex with 32 attention head number, I reran on the same platform and verified this info when i did print(model) -- 32 attention head number
However, even though 8 GPU is a multiple of 32 attention head number, the execution still failed--
Wondering if there are any intermediate step to do or changes in config or settings to apply?
Hi @gbertulf , would you mind providing the full output log in text format?
Hi Yang, please see attached output text file-- 8GPUs_llama2_7B.txt 8GPUs_llama2_7B.txt
Hi Yang, please see attached output text file-- 8GPUs_llama2_7B.txt 8GPUs_llama2_7B.txt
@gbertulf if you are loading the model in FP32, it could be the case that all 8 model are loading into cpu at the same time and there is not enough cpu memory. Can you check the total cpu memory size in your system and try monitoring the CPU memory usage when running the application.
Hi Yang, updating.
The CPU memory does hit max utilization when running 8GPUs with Vicuna 33B model on the DUT--
Please see attached vicuna 33B full txt output log attached -- 8GPUs_vicuna33B.txt
When running with 8GPUs with llama2_7B + FP16, it does not hit max memory utilization, it is challenging to capture a screenshot but I can show it to you when we have a quick sync or quick debug call.
Hi Yang, from our debug synch you indicated that on the same machine your fellow team member were not seeing issues on 8-GPU config. May I kindly ask for the BKM or steps taken?
On my end, using the reference script from the repo site, I only modified the number of GPUs to show 8. Were there any other intermediate steps needed?
export MASTER_ADDR=127.0.0.1 export FI_PROVIDER=tcp export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so:${LD_PRELOAD} basekit_root=/opt/intel/oneapi source $basekit_root/setvars.sh --force source $basekit_root/ccl/latest/env/vars.sh --force
NUM_GPUS=8 # number of used GPU export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export TORCH_LLM_ALLREDUCE=0 # Different from PVC
mpirun -np $NUM_GPUS --prepend-rank
python deepspeed_autotp_mod.py --repo-id-or-model-path '/home/gta/glen/Llama-2-7b-chat-hf' --low-bit 'sym_int4'
Hi Yang, from our debug synch you indicated that on the same machine your fellow team member were not seeing issues on 8-GPU config. May I kindly ask for the BKM or steps taken?
On my end, using the reference script from the repo site, I only modified the number of GPUs to show 8. Were there any other intermediate steps needed?
export MASTER_ADDR=127.0.0.1 export FI_PROVIDER=tcp export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so:${LD_PRELOAD} basekit_root=/opt/intel/oneapi source $basekit_root/setvars.sh --force source $basekit_root/ccl/latest/env/vars.sh --force
NUM_GPUS=8 # number of used GPU export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export TORCH_LLM_ALLREDUCE=0 # Different from PVC
mpirun -np $NUM_GPUS --prepend-rank python deepspeed_autotp_mod.py --repo-id-or-model-path '/home/gta/glen/Llama-2-7b-chat-hf' --low-bit 'sym_int4'
@Uxito-Ada are there any other intermediate steps needed?
Hi Yang, from our debug synch you indicated that on the same machine your fellow team member were not seeing issues on 8-GPU config. May I kindly ask for the BKM or steps taken? On my end, using the reference script from the repo site, I only modified the number of GPUs to show 8. Were there any other intermediate steps needed? export MASTER_ADDR=127.0.0.1 export FI_PROVIDER=tcp export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so:${LD_PRELOAD} basekit_root=/opt/intel/oneapi source $basekit_root/setvars.sh --force source $basekit_root/ccl/latest/env/vars.sh --force NUM_GPUS=8 # number of used GPU export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export TORCH_LLM_ALLREDUCE=0 # Different from PVC mpirun -np $NUM_GPUS --prepend-rank python deepspeed_autotp_mod.py --repo-id-or-model-path '/home/gta/glen/Llama-2-7b-chat-hf' --low-bit 'sym_int4'
@Uxito-Ada are there any other intermediate steps needed?
Hi @yangw1234 , no further step is needed.
I think it could because of OOM, as in the log, the programs stop after print(model)
and before tokenzier.from_pretrained
, thus maybe loading the tokenizer takes much memory.
Issue is resolved. Closing this ticket. Thank you team for your help.