[Bug]: Cannot load llama-3 gguf based models
Your current environment
PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.3 Libc version: glibc-2.35 Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU Nvidia driver version: 546.17 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Vendor ID: GenuineIntel Model name: 12th Gen Intel(R) Core(TM) i9-12900HK CPU family: 6 Model: 154 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 1 Stepping: 3 BogoMIPS: 5836.79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves umip gfni vaes vpclmulqdq rdpid fsrm md_clear flush_l1d arch_capabilities Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 480 KiB (10 instances) L1i cache: 320 KiB (10 instances) L2 cache: 12.5 MiB (10 instances) L3 cache: 24 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.3.0 [pip3] triton==2.3.0 [conda] Could not collect ROCM Version: Could not collect Aphrodite Version: 0.5.3 Aphrodite Build Flags: CUDA Archs: Not Set; ROCm: Disabled
🐛 Describe the bug
Upon entering the following command: python -m aphrodite.endpoints.openai.api_server --model Llama-3-8B-Instruct-abliterated-v2_q8.gguf
I get the following error:
INFO: Extracting config from GGUF...
WARNING: gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO: Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO: Model = 'Llama-3-8B-Instruct-abliterated-v2_q8.gguf'
INFO: Speculative Config = None
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 1
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = gguf
INFO: Context Length = 8192
INFO: Enforce Eager Mode = True
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
INFO: Converting tokenizer from GGUF...
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 562, in <module>
run_server(args)
File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 519, in run_server
engine = AsyncAphrodite.from_engine_args(engine_args)
File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 358, in from_engine_args
engine = cls(engine_config.parallel_config.worker_use_ray,
File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 323, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 429, in _init_engine
return engine_class(*args, **kwargs)
File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 125, in __init__
self._init_tokenizer()
File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 246, in _init_tokenizer
self.tokenizer: BaseTokenizerGroup = get_tokenizer_group(
File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer_group/__init__.py", line 20, in get_tokenizer_group
return TokenizerGroup(**init_kwargs)
File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer.py", line 136, in get_tokenizer
return convert_gguf_to_tokenizer(tokenizer_name)
File "/home/eugeo/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer.py", line 44, in convert_gguf_to_tokenizer
scores = result.fields['tokenizer.ggml.scores']
KeyError: 'tokenizer.ggml.scores'
I get the same error for every Llama-3 based model, whether it's 8B or 70B
Llama3 doesn't use LlamaTokenizer, you need to supply the original tokenizers by --tokenizer original_repo