aphrodite-engine icon indicating copy to clipboard operation
aphrodite-engine copied to clipboard

[Bug]: Cannot load llama-3 gguf based models

Open EugeoSynthesisThirtyTwo opened this issue 1 year ago • 1 comments

Your current environment

PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.3 Libc version: glibc-2.35 Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU Nvidia driver version: 546.17 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Vendor ID: GenuineIntel Model name: 12th Gen Intel(R) Core(TM) i9-12900HK CPU family: 6 Model: 154 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 1 Stepping: 3 BogoMIPS: 5836.79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves umip gfni vaes vpclmulqdq rdpid fsrm md_clear flush_l1d arch_capabilities Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 480 KiB (10 instances) L1i cache: 320 KiB (10 instances) L2 cache: 12.5 MiB (10 instances) L3 cache: 24 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.3.0 [pip3] triton==2.3.0 [conda] Could not collect ROCM Version: Could not collect Aphrodite Version: 0.5.3 Aphrodite Build Flags: CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

Upon entering the following command: python -m aphrodite.endpoints.openai.api_server --model Llama-3-8B-Instruct-abliterated-v2_q8.gguf

I get the following error:

INFO:     Extracting config from GGUF...
WARNING:  gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO:     Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO:     Model = 'Llama-3-8B-Instruct-abliterated-v2_q8.gguf'
INFO:     Speculative Config = None
INFO:     DataType = torch.float16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = gguf
INFO:     Context Length = 8192
INFO:     Enforce Eager Mode = True
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
INFO:     Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
INFO:     Converting tokenizer from GGUF...
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 562, in <module>
    run_server(args)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 519, in run_server
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 358, in from_engine_args
    engine = cls(engine_config.parallel_config.worker_use_ray,
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 323, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 429, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 125, in __init__
    self._init_tokenizer()
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 246, in _init_tokenizer
    self.tokenizer: BaseTokenizerGroup = get_tokenizer_group(
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer_group/__init__.py", line 20, in get_tokenizer_group
    return TokenizerGroup(**init_kwargs)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
    self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer.py", line 136, in get_tokenizer
    return convert_gguf_to_tokenizer(tokenizer_name)
  File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer.py", line 44, in convert_gguf_to_tokenizer
    scores = result.fields['tokenizer.ggml.scores']
KeyError: 'tokenizer.ggml.scores'

I get the same error for every Llama-3 based model, whether it's 8B or 70B

EugeoSynthesisThirtyTwo avatar May 18 '24 15:05 EugeoSynthesisThirtyTwo

Llama3 doesn't use LlamaTokenizer, you need to supply the original tokenizers by --tokenizer original_repo

sgsdxzy avatar May 18 '24 16:05 sgsdxzy