ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

[Max1100/bigdl-llm]Met OOM easily when running llama2-7b/Mistral-7B-v0.1 int4/fp8 multi-batch

Open Yanli2190 opened this issue 1 year ago • 3 comments

When running llama2-7b/Mistral-7B-v0.1 int4/fp8 multi-batch with Max1100, we easily met OOM issue it looks like that when we enable multi-batch, if we run the model mutli-iters, the memory keep increase for each iters HW: Max1100 OS: Ubuntu 22.04 SW: oneAPI 2024.0/bigdl-llm 2.5.0b20240118 based on torch 2.1 GPU driver: https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html How to reproduce:

  1. create conda env and install bigdl via "pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu"
  2. run the attached run.sh on Max1100 and monitor the GPU memory via " sudo xpu-smi dump -m 0,1,2,3,4,5,18"
  3. The GPU memory will increase per each iter and we will meet OOM after multi-iters

Yanli2190 avatar Jan 23 '24 15:01 Yanli2190

log.txt

Yanli2190 avatar Jan 23 '24 15:01 Yanli2190

We failed to reproduce this problem on our machine (max1100)

Environments:

bigdl's version: 2.5.0b20240118
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1100 OpenCL 3.0 NEO  [23.30.26918.50]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.26918]

Here is the log:

 - INFO - intel_extension_for_pytorch auto imported
loading model...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13.34it/s]
 - INFO - Converting the current model to sym_int4 format......
LlamaAttention(
  (q_proj): LowBitLinear(in_features=4096, out_features=4096, bias=False)
  (k_proj): LowBitLinear(in_features=4096, out_features=4096, bias=False)
  (v_proj): LowBitLinear(in_features=4096, out_features=4096, bias=False)
  (o_proj): LowBitLinear(in_features=4096, out_features=4096, bias=False)
  (rotary_emb): LlamaRotaryEmbedding()
)
warming up for 10 iterations...
finished warmup
prefill (512 tokens x 8 batch) + generation (512 tokens x 8 batch):
0
    iter 1:  xx sec total
1
    iter 2:  xx sec total
2
    iter 3:  xx sec total
3
    iter 4:  xx sec total
4
    iter 5:  xx sec total
5
    iter 6:  xx sec total
6
    iter 7:  xx sec total
7
    iter 8:  xx sec total
8
    iter 9:  xx sec total
9
    iter 10:  xx sec total
10
    iter 11:  xx sec total
11

Here is the GPU mem stats:

Timestamp, DeviceId, GPU Utilization (%), GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree), GPU Memory Temperature (Celsius Degree), GPU Memory Utilization (%), GPU Memory Used (MiB)
01:02:55.000,    0, 99.76, 196.87, 1550,  N/A,  N/A, 43.91, 21579.29
01:02:56.000,    0, 99.81, 196.76, 1550,  N/A,  N/A, 43.91, 21579.29
01:02:57.000,    0, 99.82, 197.18, 1550,  N/A,  N/A, 43.91, 21579.29
01:02:58.000,    0, 99.85, 197.55, 1550,  N/A,  N/A, 43.91, 21579.29
01:02:59.000,    0, 89.60, 184.65,    0,  N/A,  N/A, 43.91, 21579.29
01:03:00.000,    0, 0.00, 27.95,    0,  N/A,  N/A, 43.91, 21579.29
01:03:01.000,    0, 0.00, 27.88,    0,  N/A,  N/A, 43.91, 21579.29
01:03:02.000,    0, 0.00, 27.85,    0,  N/A,  N/A, 43.91, 21579.29
01:03:03.000,    0, 0.00, 27.78,    0,  N/A,  N/A, 43.91, 21579.29
01:03:04.000,    0, 9.05, 51.09, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:05.000,    0, 99.33, 209.28, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:06.000,    0, 99.67, 191.56, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:07.000,    0, 99.77, 192.01, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:08.000,    0, 99.82, 193.04, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:09.000,    0, 99.78, 192.70, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:10.000,    0, 99.82, 192.79, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:11.000,    0, 99.82, 192.93, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:12.000,    0, 99.82, 193.61, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:13.000,    0, 99.79, 193.89, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:14.000,    0, 99.69, 194.24, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:15.000,    0, 99.51, 194.23, 1550,  N/A,  N/A, 43.91, 21579.28
01:03:16.000,    0, 99.55, 195.14, 1550,  N/A,  N/A, 43.91, 21579.28
01:03:17.000,    0, 99.55, 195.87, 1550,  N/A,  N/A, 43.91, 21579.28
01:03:18.000,    0, 99.54, 195.74, 1550,  N/A,  N/A, 43.91, 21579.28
01:03:19.000,    0, 99.74, 196.17, 1550,  N/A,  N/A, 43.91, 21579.28
01:03:20.000,    0, 99.71, 196.35, 1550,  N/A,  N/A, 43.91, 21579.28
01:03:21.000,    0, 99.82, 197.02, 1550,  N/A,  N/A, 43.91, 21579.28
01:03:22.000,    0, 99.82, 197.39, 1550,  N/A,  N/A, 43.91, 21579.28
01:03:23.000,    0, 99.83, 197.36, 1550,  N/A,  N/A, 43.91, 21579.28
01:03:24.000,    0, 99.85, 197.49, 1550,  N/A,  N/A, 43.91, 21579.28
01:03:25.000,    0, 40.15, 109.33,    0,  N/A,  N/A, 43.91, 21579.28
01:03:26.000,    0, 4.43, 46.86,    0,  N/A,  N/A, 43.91, 21579.28
01:03:27.000,    0, 0.00, 27.86,    0,  N/A,  N/A, 43.91, 21579.28
01:03:28.000,    0, 0.00, 27.75,    0,  N/A,  N/A, 43.91, 21579.28
01:03:29.000,    0, 0.00, 27.72,    0,  N/A,  N/A, 43.91, 21579.28
01:03:30.000,    0, 58.30, 191.29, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:31.000,    0, 99.38, 191.68, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:32.000,    0, 99.57, 191.20, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:33.000,    0, 99.68, 191.98, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:34.000,    0, 99.81, 192.40, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:35.000,    0, 99.82, 192.78, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:36.000,    0, 99.81, 193.62, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:37.000,    0, 99.82, 193.19, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:38.000,    0, 99.81, 193.47, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:39.000,    0, 99.80, 193.92, 1550,  N/A,  N/A, 46.94, 23066.18
01:03:40.000,    0, 98.81, 194.58, 1550,  N/A,  N/A, 43.91, 21579.29
01:03:41.000,    0, 99.59, 195.55, 1550,  N/A,  N/A, 43.91, 21579.29
01:03:42.000,    0, 99.60, 196.16, 1550,  N/A,  N/A, 43.91, 21579.29
01:03:43.000,    0, 99.73, 196.77, 1550,  N/A,  N/A, 43.91, 21579.29
01:03:44.000,    0, 99.75, 196.72, 1550,  N/A,  N/A, 43.91, 21579.29
01:03:45.000,    0, 99.77, 196.74, 1550,  N/A,  N/A, 43.91, 21579.29
01:03:46.000,    0, 99.80, 197.54, 1550,  N/A,  N/A, 43.91, 21579.29
01:03:47.000,    0, 99.75, 197.94, 1550,  N/A,  N/A, 43.91, 21579.29
01:03:48.000,    0, 99.84, 197.89, 1550,  N/A,  N/A, 43.91, 21579.29
01:03:49.000,    0, 99.82, 197.96, 1550,  N/A,  N/A, 43.91, 21579.29
01:03:50.000,    0, 99.83, 198.46, 1550,  N/A,  N/A, 43.91, 21579.29
01:03:51.000,    0, 1.77, 45.10,    0,  N/A,  N/A, 43.91, 21579.29
01:03:52.000,    0, 0.00, 27.82,    0,  N/A,  N/A, 43.91, 21579.29

Ricky-Ting avatar Jan 26 '24 09:01 Ricky-Ting