LightCompress 量化QWen2.5-32B的时候OOM

在replace-model阶段会OOM可能是什么问题呢模型是QWen32B，用的是7张A6000 nvidia-smi显示启动了7张卡，但是模型好像只放在gpu0

配置文件：

base:
    seed: &seed 42
model:
    type: Qwen2
    path: ./DeepSeek-R1-Distill-Qwen-32B
    tokenizer_mode: slow
    torch_dtype: auto
# calib:
#     name: pileval
#     download: True
#     path: ./LLMCompress/data
#     n_samples: 128
#     bs: -1
#     seq_len: 512
#     preproc: general
#     seed: *seed
eval:
    eval_pos: [fake_quant]
    name: wikitext2
    download: True
    path: ./LLMCompress/data
    seq_len: 2048
    # For 7B / 13B model eval, bs can be set to "1", and inference_per_block can be set to "False".
    # For 70B model eval, bs can be set to "20", and inference_per_block can be set to "True".
    bs: 1
    inference_per_block: False
quant:
    method: RTN
    weight:
        bit: 8
        symmetric: True
        granularity: per_channel
        group_size: -1
    act:
        bit: 8
        symmetric: True
        granularity: per_token
save:
    # save_fake: True
    save_vllm: True
    save_path: ./vllm_w8a8/

运行脚本

#!/bin/bash

# export CUDA_VISIBLE_DEVICES=0,1,2,3

llmc=./LLMCompress/llmc
export PYTHONPATH=$llmc:$PYTHONPATH

task_name=quant
config=quant.yml

nnodes=1
nproc_per_node=7


find_unused_port() {
    while true; do
        port=$(shuf -i 10000-60000 -n 1)
        if ! ss -tuln | grep -q ":$port "; then
            echo "$port"
            return 0
        fi
    done
}
UNUSED_PORT=$(find_unused_port)


MASTER_ADDR=127.0.0.1
MASTER_PORT=$UNUSED_PORT
task_id=$UNUSED_PORT

nohup \
torchrun \
--nnodes $nnodes \
--nproc_per_node $nproc_per_node \
--rdzv_id $task_id \
--rdzv_backend c10d \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
${llmc}/llmc/__main__.py --config $config --task_id $task_id \
> ${task_name}.log 2>&1 &

sleep 2
ps aux | grep '__main__.py' | grep $task_id | awk '{print $2}' > ${task_name}.pid

# You can kill this program by 
# xargs kill -9 < xxx.pid
# xxx.pid is ${task_name}.pid file

log:


2025-02-27 00:07:03.578 | INFO     | llmc.compression.quantization.base_blockwise_quantization:deploy:1007 - -- deploy_fake_quant_model done --
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/liangbohan/Projects/LLMCompress/llmc/llmc/__main__.py", line 248, in <module>
[rank0]:     main(config)
[rank0]:   File "/home/liangbohan/Projects/LLMCompress/llmc/llmc/__main__.py", line 85, in main
[rank0]:     eval_model(model, blockwise_opts, eval_list, eval_pos='fake_quant')
[rank0]:   File "/home/liangbohan/Projects/LLMCompress/llmc/llmc/eval/utils.py", line 87, in eval_model
[rank0]:     res = eval_class.eval(model)
[rank0]:   File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/liangbohan/Projects/LLMCompress/llmc/llmc/eval/eval_base.py", line 197, in eval
[rank0]:     model_llmc.model.cuda()
[rank0]:   File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3117, in cuda
[rank0]:     return super().cuda(*args, **kwargs)
[rank0]:   File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1050, in cuda
[rank0]:     return self._apply(lambda t: t.cuda(device))
[rank0]:   File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]:     module._apply(fn)
[rank0]:   File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]:     module._apply(fn)
[rank0]:   File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]:     module._apply(fn)
[rank0]:   [Previous line repeated 2 more times]
[rank0]:   File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 988, in _apply
[rank0]:     self._buffers[key] = fn(buf)
[rank0]:   File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1050, in <lambda>
[rank0]:     return self._apply(lambda t: t.cuda(device))
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 47.54 GiB of which 99.00 MiB is free. Including non-PyTorch memory, this process has 47.42 GiB memory in use. Of the allocated memory 46.98 GiB is allocated by PyTorch, and 308.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W227 00:07:12.475262648 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W0227 00:07:14.369000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273124 closing signal SIGTERM
W0227 00:07:14.370000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273125 closing signal SIGTERM
W0227 00:07:14.370000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273126 closing signal SIGTERM
W0227 00:07:14.370000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273127 closing signal SIGTERM
W0227 00:07:14.370000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273128 closing signal SIGTERM
W0227 00:07:14.371000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 273129 closing signal SIGTERM
E0227 00:07:14.635000 273012 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 273123) of binary: /home/liangbohan/anaconda3/envs/llm_compress/bin/python
Traceback (most recent call last):
  File "/home/liangbohan/anaconda3/envs/llm_compress/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/liangbohan/anaconda3/envs/llm_compress/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/liangbohan/Projects/LLMCompress/llmc/llmc/__main__.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-27_00:07:14
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 273123)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Feb 26 '25 08:02 Wowoho

eval: eval_pos: [fake_quant] name: wikitext2 download: True path: ./LLMCompress/data seq_len: 2048 # For 7B / 13B model eval, bs can be set to "1", and inference_per_block can be set to "False". # For 70B model eval, bs can be set to "20", and inference_per_block can be set to "True". bs: 20 inference_per_block: True 改成这样试一下

Feb 27 '25 07:02 gushiqiao

如果还oom，就减少bs，这是因为eval是在单卡上跑的，启动的时候，单卡就行

Feb 27 '25 07:02 gushiqiao

# export CUDA_VISIBLE_DEVICES=0,1,2,3 运行脚本这里的注释要打开，7张卡就export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6

Mar 06 '25 20:03 Kexin2000