Behavior

65B models running with CUBLAS fully offloaded to the gpu break as prompts approach the model's max context size (the models I've tested with are 2048), they start spitting out garbage. The exact same prompts seem to work on smaller models, and I haven't noticed any problems with smaller models in regular use.

This started happening when k/v cache offloading to the GPU was implemented, and passing -lv to disable that offloading does also fix the issue. Offloading exactly n_layers also works, and up to n_layers + 2 also works.

I also tested various quantisation levels from Q4_1 to Q8_0, and they make no difference. I've included a repro and some of my testing below.

Environment and Context

Tested on commit aacdbd40562684665b6f7b8ba6695b7a2088bbb0

Physical (or virtual) hardware:

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          26
On-line CPU(s) list:             0-25
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       26
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8480+
Stepping:                        8
CPU MHz:                         2000.000
BogoMIPS:                        4000.00
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       832 KiB
L1i cache:                       832 KiB
L2 cache:                        104 MiB
L3 cache:                        416 MiB
NUMA node0 CPU(s):               0-25
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Unknown: No mitigations
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx
                                  fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good n
                                 opl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_
                                 1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hyperviso
                                 r lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_en
                                 hanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 s
                                 mep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb a
                                 vx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf1
                                 6 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq
                                  avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote mov
                                 diri movdir64b fsrm md_clear serialize tsxldtrk avx512_fp16 arch_capabilities

RAM: 196GB GPU: NVIDIA H100 - 80GB vram

Operating System:

Linux 209-20-157-110 5.15.0-75-generic #82~20.04.1-Ubuntu SMP Wed Jun 7 19:37:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

SDK versions:

$ python3 --version
Python 3.8.10

$ make --version
GNU Make 4.2.1
Built for x86_64-pc-linux-gnu

$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Steps to Reproduce

cuda_kv_65b_fail.txt cuda_kv_65b_pass.txt

After some trial and error, I've come up with the above two prompt files, the difference is 2 characters of length (a space and a letter) of the end of the initial instruction paragraph. I've used -c 2048 -t 4 -ngl 100 -n 30 for these tests.

Also I realise I'm not properly substituting the AI_NAME/USER_NAME in these prompts but at least for the models I've tested it didn't matter.

Running the fail prompt with a 65B model, offloading all layers to the GPU should fail to return a reasonable result.
Running the pass prompt with the same settings should work.
Running the fail prompt against a smaller 33 or 30B model should work.
Running the fail prompt against the same 65B model with -lv or less offloaded layers than the max should work.

The models I have tested with are:

airoboros-33b-gpt4-1.2.ggmlv3.q8_0.bin

airoboros-65B-gpt4-1.2.ggmlv3.q4_1.bin
airoboros-65B-gpt4-1.2.ggmlv3.q4_K_M.bin
airoboros-65B-gpt4-1.2.ggmlv3.q6_K.bin
airoboros-65B-gpt4-1.2.ggmlv3.q8_0.bin

guanaco-65B.ggmlv3.q8_0.bin

Failure Logs

While I have additionally validated the above text prompts via the cli, I initially repro'd this against the example server rather than the cli as I thought there were things going on with state across generations, so my main actual testing was done with this nodejs script: https://gist.github.com/ycros/254ccb8a15403016cbb2feb31d8a5ff3

I used the script to truncate the start of the prompt until there was a pass, and then zeroed in on the prompt length at which failures start to happen.

And here is a log of my testing results around this prompt length: https://gist.github.com/ycros/63a3e7a3787d5bfbb66bc7ead08b345e

Jun 20 '23 06:06 ycros

@ycros hey this is possibly related, but any chance you could confirm how much RAM llama.cpp is using while you have a 65b fully offloaded to GPU vs not using GPU? #1866

Jun 20 '23 07:06 tangles-0

I spent a while staring at the code, and I've figured it out - I think it's allocating too small a vram scratch buffer for 65B, increasing the vram_scratch size solves the problem, see my commit here: https://github.com/ycros/koboldcpp/commit/021e5099c790cdf21a0f1844491e34cb988a94b9

I haven't logged a PR because I'm not sure how to appropriately calculate a reasonable value here.

Jun 22 '23 08:06 ycros

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 10 '24 01:04 github-actions[bot]

[User] 65B models on CUBLAS/cuda bugged when prompts approach model's max context size

Behavior

Environment and Context

Steps to Reproduce

Failure Logs