GLM-130B 量化成int4后爆cpu内存

量化成int4后爆cpu内存

Open ionescofung opened this issue 2 years ago • 5 comments

模型量化成int4后，执行： torchrun --nproc_per_node 4 /mnt/nvme/GLM-130B/generate.py --seed 1234 --mode inference --sampling-strategy BaseStrategy --out-seq-length 256 --min-gen-length 0 --num-beams 4 --length-penalty 1.0 --no-repeat-ngram-size 3 --temperature 1.0 --top_k 0 --top_p 0.7 --output-path samples --model-parallel-size 4 --num-layers 70 --hidden-size 12288 --inner-hidden-size 32768 --vocab-size 150528 --num-attention-heads 96 --max-sequence-length 2048 --tokenizer-type icetk-glm-130B --layernorm-order post --quantization-bit-width 4 --load /mnt/nvme/GLM-130B/checkp2/glm --skip-init --fp16 --input-source hello[gMASK] 我电脑是316G内存，但还是爆内存了我本机有8张3090好像没有使用到显存系统是ubuntu 20.04 为什么会这样？

Feb 17 '23 07:02 ionescofung

GLM-130B GLM-130B copied to clipboard

量化成int4后爆cpu内存

GLM-130B
GLM-130B copied to clipboard