PowerInfer
PowerInfer copied to clipboard
GPU is not used after model is loaded
Prerequisites
Before submitting your issue, please ensure the following:
- [x] I am running the latest version of PowerInfer. Development is rapid, and as of now, there are no tagged versions.
- [ ] I have carefully read and followed the instructions in the README.md.
- [ ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
Expected Behavior
Current Behavior
Output of nvidia-smi after model is loaded
Wed Jan 3 07:23:06 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 57C P0 24W / 80W | 5228MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 146126 C ./build/bin/main 5222MiB |
+---------------------------------------------------------------------------------------+
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
- Physical (or virtual) hardware you are using, e.g. for Linux:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
CPU family: 6
Model: 141
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 1
CPU(s) scaling MHz: 89%
CPU max MHz: 4600.0000
CPU min MHz: 800.0000
BogoMIPS: 4609.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi m
mx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon p
ebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq
dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_
fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi
flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a av
x512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512b
w avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hw
p_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni va
es vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx5
12_vp2intersect md_clear ibt flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 384 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 10 MiB (8 instances)
L3: 24 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerabilities:
Gather data sampling: Vulnerable: No microcode
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected
- Operating System, e.g. for Linux: Archlinux
$ uname -a
uname -a
Linux ***** 6.1.70-1-lts #1 SMP PREEMPT_DYNAMIC Mon, 01 Jan 2024 13:44:01 +0000 x86_64 GNU/Linux
- SDK version, e.g. for Linux:
$ python3 --version
Python 3.11.6
$ make --version
GNU Make 4.4.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ g++ --version
g++ (GCC) 13.2.1 20230801
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Failure Information (for bugs)
Please help provide information about the failure / bug.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
- step 1
- step 2
- step 3
- etc.
Failure Logs
$ git log | head -1commit 74c5c5895b9acda1fc2224bb3ac87a9767d451f6
llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
egrep: warning: egrep is obsolescent; using grep -E
numpy 1.26.2
sentencepiece 0.1.99
torch 2.1.2
$ md5sum llama-13b-relu.powerinfer.gguf
d8daf12964ce178e9f9cef6eaf3c7be1 llama-13b-relu.powerinfer.gguf
command used:
./build/bin/main -m ../llama-13b-relu.powerinfer.gguf
-n 128 -t 8 --vram-budget 5 -p "Once upon a time"
bottom part of the log
llm_load_gpu_split: offloaded 0.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 400.00 MB
llama_build_graph: non-view tensors processed: 684/1044
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 8.25 MB
llama_new_context_with_model: VRAM scratch buffer: 6.69 MB
llama_new_context_with_model: total VRAM used: 5107.20 MB (model: 5100.51 MB, context: 6.69 MB)
system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 32, n_predict = 128, n_keep = 0
Once upon a time, the world had only one planet. Humans lived on this⏎
Same problem......
Same here. After initial load model loads quickly but inference relies on CPU and is slow ....
In this scenario, the GPU is indeed utilized for token generation, but the performance bottleneck primarily lies with the CPU. This imbalance causes the GPU to frequently wait for the CPU's computation results, leading to low GPU utilization.
To leverage optimal performance advantage with PowerInfer, we generally recommend using models that are 2-3x larger than the available VRAM. In such configurations, most of the densely activated tensors can be offloaded to the GPU, while the CPU processes only the sparsely activated tensors. So there is a more balanced workload distribution between these two sides.
I'm using T4 GPU. The same as above. Only GPU RAM 0.1 GB is used
GPU RAM 0.1 / 15.0 GB