[Bug]: Error at Custom KoboldAI Endpoint! The custom endpoint failed to respond correctly. You may wish to try a different URL or API type.

Open baditaflorin opened this issue 1 year ago • 1 comments

Your current environment

The output of `python env.py`

```text python3 env.py Collecting environment information... PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.1 LTS (x86_64) GCC version: (Ubuntu 13.2.0-23ubuntu4) 13.2.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39

Python version: 3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0] (64-bit runtime) Python platform: Linux-6.8.0-50-generic-x86_64-with-glibc2.39 Is CUDA available: N/A CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA A100-PCIE-40GB Nvidia driver version: 550.120 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: AuthenticAMD Model name: AMD EPYC-Milan-v2 Processor CPU family: 25 Model: 1 Thread(s) per core: 1 Core(s) per socket: 12 Socket(s): 1 Stepping: 1 BogoMIPS: 5589.49 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke vaes vpclmulqdq rdpid fsrm Hypervisor vendor: KVM Virtualization type: full L1d cache: 384 KiB (12 instances) L1i cache: 384 KiB (12 instances) L2 cache: 6 MiB (12 instances) L3 cache: 32 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-11 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] No relevant packages [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A Aphrodite Version: N/A Aphrodite Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-11 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks```

🐛 Describe the bug

I run the docker demo from readme and when i access the url http://localhost:2242/ i get the error Error at Custom KoboldAI Endpoint! The custom endpoint failed to respond correctly. You may wish to try a different URL or API type.


docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     --env "CUDA_VISIBLE_DEVICES=0"     -p 2242:2242     --ipc=host     alpindale/aphrodite-openai:latest     --model NousResearch/Meta-Llama-3.1-8B-Instruct     --tensor-parallel-size 1     --api-keys "sk-empty" --distributed-executor-backend=mp
INFO:     Multiprocessing frontend to use
ipc:///tmp/6613166f-863d-42db-98cc-5c78ae5f00a4 for RPC Path.
INFO:     Started engine process with PID 44
WARNING:  The model has a long context length (131072). This may cause OOM
errors during the initial memory profiling phase, or result in low performance
due to small KV cache space. Consider setting --max-model-len to a smaller
value.
INFO:
--------------------------------------------------------------------------------
-----
INFO:     Initializing Aphrodite Engine (v0.6.4.post1 commit 20f11fd0) with the
following config:
INFO:     Model = 'NousResearch/Meta-Llama-3.1-8B-Instruct'
INFO:     DataType = torch.bfloat16
INFO:     Tensor Parallel Size = 1
INFO:     Pipeline Parallel Size = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Context Length = 131072
INFO:     Enforce Eager Mode = False
INFO:     Prefix Caching = False
INFO:     Device = device(type='cuda')
INFO:     Guided Decoding Backend =
DecodingConfig(guided_decoding_backend='lm-format-enforcer')
INFO:
--------------------------------------------------------------------------------
-----
WARNING:  Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary
CPU contention. Set OMP_NUM_THREADS in the external environment to tune this
value as needed.
INFO:     Loading model NousResearch/Meta-Llama-3.1-8B-Instruct...
INFO:     Using model weights format ['*.safetensors']
⠋ Loading model weights... ━━━━━━━━━━━━━━━━━━━━━━━╸ 100% 14.96/14.96 GiB 0:00:02
INFO:     Model weights loaded in 133.76 seconds.
INFO:     Total model weights memory usage: 14.99 GiB
INFO:     Profiling peak memory usage...
INFO:     Model profiling took 8.69 seconds.
INFO:     # GPU blocks: 2214, # CPU blocks: 2048
INFO:     Minimum concurrency: 0.27x
INFO:     Maximum sequence length allowed in the cache: 35424
ERROR:    The model's max seq len (131072) is larger than the maximum number of
tokens that can be stored in KV cache (35424). Try increasing
`gpu_memory_utilization`, setting `--enable-chunked-prefill`, or
`--kv-cache-dtype fp8` when initializing the engine. The last two are currently
mutually exclusive.
ERROR:    Forcing max_model_len to 35424.
INFO:     Capturing the model for CUDA graphs. This may lead to unexpected
consequences if the model is not static. To run the model in eager mode, set
'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO:     CUDA graphs can take additional 1~3 GiB memory per GPU. If you are
running out of memory, consider decreasing `gpu_memory_utilization` or enforcing
eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory
usage.
INFO:     Graph capturing finished in 9.70 secs
INFO:     Aphrodite to use /tmp/tmp9qsgperp as PROMETHEUS_MULTIPROC_DIR
WARNING:  Admin key not provided. Admin operations will be disabled.
WARNING:  embedding_mode is False. Embedding API will not work.
INFO:     Kobold Lite UI:   http://localhost:2242/
INFO:     Documentation:    http://localhost:2242/redoc
INFO:     Completions API:  http://localhost:2242/v1/completions
INFO:     Chat API:         http://localhost:2242/v1/chat/completions
INFO:     Embeddings API:   http://localhost:2242/v1/embeddings
INFO:     Tokenization API: http://localhost:2242/v1/tokenize
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:2242 (Press CTRL+C to quit)
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0
tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:
0.0%, CPU KV cache usage: 0.0%.
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0
tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage:

Dec 12 '24 07:12 baditaflorin

Try removing the --api-keys arg. Setting up the Kobold UI with an api key is more involved.

Dec 12 '24 14:12 AlpinDale