LightCompress
LightCompress copied to clipboard
量化QWen2.5-32B的时候OOM
在replace-model阶段会OOM可能是什么问题呢 模型是QWen32B,用的是7张A6000 nvidia-smi显示启动了7张卡,但是模型好像只放在gpu0
配置文件:
base:
seed: &seed 42
model:
type: Qwen2
path: ./DeepSeek-R1-Distill-Qwen-32B
tokenizer_mode: slow
torch_dtype: auto
# calib:
# name: pileval
# download: True
# path: ./LLMCompress/data
# n_samples: 128
# bs: -1
# seq_len: 512
# preproc: general
# seed: *seed
eval:
eval_pos: [fake_quant]
name: wikitext2
download: True
path: ./LLMCompress/data
seq_len: 2048
# For 7B / 13B model eval, bs can be set to "1", and inference_per_block can be set to "False".
# For 70B model eval, bs can be set to "20", and inference_per_block can be set to "True".
bs: 1
inference_per_block: False
quant:
method: RTN
weight:
bit: 8
symmetric: True
granularity: per_channel
group_size: -1
act:
bit: 8
symmetric: True
granularity: per_token
save:
# save_fake: True
save_vllm: True
save_path: ./vllm_w8a8/
运行脚本
#!/bin/bash
# export CUDA_VISIBLE_DEVICES=0,1,2,3
llmc=./LLMCompress/llmc
export PYTHONPATH=$llmc:$PYTHONPATH
task_name=quant
config=quant.yml
nnodes=1
nproc_per_node=7
find_unused_port() {
while true; do
port=$(shuf -i 10000-60000 -n 1)
if ! ss -tuln | grep -q ":$port "; then
echo "$port"
return 0
fi
done
}
UNUSED_PORT=$(find_unused_port)
MASTER_ADDR=127.0.0.1
MASTER_PORT=$UNUSED_PORT
task_id=$UNUSED_PORT
nohup \
torchrun \
--nnodes $nnodes \
--nproc_per_node $nproc_per_node \
--rdzv_id $task_id \
--rdzv_backend c10d \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
${llmc}/llmc/__main__.py --config $config --task_id $task_id \
> ${task_name}.log 2>&1 &
sleep 2
ps aux | grep '__main__.py' | grep $task_id | awk '{print $2}' > ${task_name}.pid
# You can kill this program by
# xargs kill -9 < xxx.pid
# xxx.pid is ${task_name}.pid file
log:
2025-02-27 00:07:03.578 | INFO | llmc.compression.quantization.base_blockwise_quantization:deploy:1007 - -- deploy_fake_quant_model done --
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/liangbohan/Projects/LLMCompress/llmc/llmc/__main__.py", line 248, in <module>
[rank0]: main(config)
[rank0]: File "/home/liangbohan/Projects/LLMCompress/llmc/llmc/__main__.py", line 85, in main
[rank0]: eval_model(model, blockwise_opts, eval_list, eval_pos='fake_quant')
[rank0]: File "/home/liangbohan/Projects/LLMCompress/llmc/llmc/eval/utils.py", line 87, in eval_model
[rank0]: res = eval_class.eval(model)
[rank0]: File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/liangbohan/Projects/LLMCompress/llmc/llmc/eval/eval_base.py", line 197, in eval
[rank0]: model_llmc.model.cuda()
[rank0]: File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3117, in cuda
[rank0]: return super().cuda(*args, **kwargs)
[rank0]: File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1050, in cuda
[rank0]: return self._apply(lambda t: t.cuda(device))
[rank0]: File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: [Previous line repeated 2 more times]
[rank0]: File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 988, in _apply
[rank0]: self._buffers[key] = fn(buf)
[rank0]: File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1050, in <lambda>
[rank0]: return self._apply(lambda t: t.cuda(device))
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 47.54 GiB of which 99.00 MiB is free. Including non-PyTorch memory, this process has 47.42 GiB memory in use. Of the allocated memory 46.98 GiB is allocated by PyTorch, and 308.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W227 00:07:12.475262648 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0227 00:07:14.369000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273124 closing signal SIGTERM
W0227 00:07:14.370000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273125 closing signal SIGTERM
W0227 00:07:14.370000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273126 closing signal SIGTERM
W0227 00:07:14.370000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273127 closing signal SIGTERM
W0227 00:07:14.370000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273128 closing signal SIGTERM
W0227 00:07:14.371000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273129 closing signal SIGTERM
E0227 00:07:14.635000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 273123) of binary: /home/liangbohan/anaconda3/envs/llm_compress/bin/python
Traceback (most recent call last):
File "/home/liangbohan/anaconda3/envs/llm_compress/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/liangbohan/Projects/LLMCompress/llmc/llmc/__main__.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-27_00:07:14
host : localhost.localdomain
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 273123)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
eval: eval_pos: [fake_quant] name: wikitext2 download: True path: ./LLMCompress/data seq_len: 2048 # For 7B / 13B model eval, bs can be set to "1", and inference_per_block can be set to "False". # For 70B model eval, bs can be set to "20", and inference_per_block can be set to "True". bs: 20 inference_per_block: True 改成这样试一下
如果还oom,就减少bs,这是因为eval是在单卡上跑的,启动的时候,单卡就行
# export CUDA_VISIBLE_DEVICES=0,1,2,3
运行脚本这里的注释要打开,7张卡就export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6