ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

intelanalytics/multi-arc-serving:0.8.3-b21 on CORE i7 13900 * A770 stress testing fail causing system dead after serval rounds

Open wluo7 opened this issue 5 months ago • 1 comments

Describe the bug

intelanalytics/multi-arc-serving:0.8.3-b21 on 2*A770 Core i7 13900, batch size 20, input token 1024, output token 512, QwQ-32B-AWQ. stress test fail after several rounds, ssh connection to the server broke, need physical restart of the server to recover.

How to reproduce Steps to reproduce the error:

b21_test.zip please find start-vllm-service.sh and stress_test_b21.sh in the attached zip file. and put them in the mounted docker path.

export DOCKER_IMAGE=intelanalytics/multi-arc-serving:0.8.3-b21 export CONTAINER_NAME=b21-test

sudo docker run -itd
--net=host
--device=/dev/dri
--privileged
-v /home/intel/models/:/llm/models/
--name=$CONTAINER_NAME
--shm-size="32g"
--entrypoint /bin/bash
$DOCKER_IMAGE

docker exec -it b21-test /bin/bash

cd models/

./start-vllm-service.sh

open another terminal:

docker exec -it b21-test /bin/bash

cd models/

./stress_test_b21.sh

Screenshots

Image

Environment information

PYTHON_VERSION=3.10.12

transformers=4.52.4

torch=2.7.1+cu126

ipex-llm ----------------------------------------------------------------- IPEX is not installed.

CPU Information: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Vendor ID: GenuineIntel Model name: 13th Gen Intel(R) Core(TM) i7-13700 CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 1 CPU max MHz: 5200.0000 CPU min MHz: 800.0000 BogoMIPS: 4224.00

Total CPU Memory: 62.5639 GB Memory Type: DDR5

Operating System: Ubuntu 22.04.5 LTS \n \l


Linux intel-Default-string 6.8.0-65-generic #68~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 15 18:06:34 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

CLI: Version: 1.2.27.20240103 Build ID: 5eeb3f13

Service: Version: 1.2.27.20240103 Build ID: 5eeb3f13 Level Zero Version: 1.14.0

Driver UUID 32342e35-322e-3332-3232-342e35000000 Driver Version 24.52.32224.5 Driver UUID 32342e35-322e-3332-3232-342e35000000 Driver Version 24.52.32224.5 Driver UUID 32342e35-322e-3332-3232-342e35000000 Driver Version 24.52.32224.5

Driver related package version: ii intel-fw-gpu 2025.13.2-398~22.04 all Firmware package for Intel integrated and discrete GPUs ii intel-i915-dkms 1.23.10.92.231129.101+i141-1 all Out of tree i915 driver. ii intel-level-zero-gpu 1.6.32224.5 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero. ii intel-level-zero-gpu-dbgsym 1.6.32224.5 amd64 debug symbols for intel-level-zero-gpu ii level-zero-dev 1.14.0-803.123~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.

igpu not detected

xpu-smi is properly installed.

+-----------+--------------------------------------------------------------------------------------+ | Device ID | Device Information | +-----------+--------------------------------------------------------------------------------------+ | 0 | Device Name: Intel(R) UHD Graphics 770 | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-0200-0000-0004a7808086 | | | PCI BDF Address: 0000:00:02.0 | | | DRM Device: /dev/dri/card0 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ | 1 | Device Name: Intel(R) Arc(TM) A770 Graphics | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-0003-0000-000856a08086 | | | PCI BDF Address: 0000:03:00.0 | | | DRM Device: /dev/dri/card1 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ | 2 | Device Name: Intel(R) Arc(TM) A770 Graphics | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-0007-0000-000856a08086 | | | PCI BDF Address: 0000:07:00.0 | | | DRM Device: /dev/dri/card2 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ GPU0 Memory size=16M GPU1 Memory size=16G GPU2 Memory size=16G

00:02.0 VGA compatible controller: Intel Corporation Device a780 (rev 04) (prog-if 00 [VGA controller]) DeviceName: Onboard - Video Subsystem: Intel Corporation Device 2212 Flags: bus master, fast devsel, latency 0, IRQ 168, IOMMU group 0 Memory at 6c01000000 (64-bit, non-prefetchable) [size=16M] Memory at 4000000000 (64-bit, prefetchable) [size=256M] I/O ports at 4000 [size=64] Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] Capabilities: [40] Vendor Specific Information: Len=0c <?>

03:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller]) Subsystem: Device 1ef7:1334 Flags: bus master, fast devsel, latency 0, IRQ 169, IOMMU group 15 Memory at 83000000 (64-bit, non-prefetchable) [size=16M] Memory at 6800000000 (64-bit, prefetchable) [1;33m[size=16G][0m Expansion ROM at 84000000 [disabled] [size=2M] Capabilities: [40] Vendor Specific Information: Len=0c <?> Capabilities: [70] Express Endpoint, MSI 00 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+

07:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller]) Subsystem: Device 1ef7:1334 Flags: bus master, fast devsel, latency 0, IRQ 173, IOMMU group 20 Memory at 81000000 (64-bit, non-prefetchable) [size=16M] Memory at 6000000000 (64-bit, prefetchable) [1;33m[size=16G][0m Expansion ROM at 82000000 [disabled] [size=2M] Capabilities: [40] Vendor Specific Information: Len=0c <?> Capabilities: [70] Express Endpoint, MSI 00 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+

Additional context Add any other context about the problem here.

wluo7 avatar Aug 04 '25 03:08 wluo7

I can't reproduce it in our ENV on desktop with CPU Intel(R) Core(TM) i9-14900K and kernel 6.5.0-28-generic, and we don't have a UHD Graphics on it. It looks like a issue that caused by environment. Maybe you can follow this guide to install env and test again.

hzjane avatar Aug 05 '25 01:08 hzjane