Can get work on tabby 0.13.1 or 0.14.0 follow by the quick-start guide
Describe the bug Can get work on tabby 0.13.1 or 0.14.0 follow by the quick-start guide, it's just start process with the a embed model
/opt/tabby/bin/llama-server -m /data/models/TabbyML/Nomic-Embed-Text/ggml/model.gguf --cont-batching --port 30888 -np 1 --log-disable --ctx-size 4096 -ngl 9999 --embedding --ubatch-size 4096
and hang for ever
docker run -it --name tabbyserver4 --restart=unless-stopped --gpus '"device=0"' -p 8082:8080 -v /data/tabby:/data tabbyml/tabby serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct --device cuda
Writing to new file.
🎯 Downloaded https://huggingface.co/TabbyML/models/resolve/main/starcoderbase-1B.Q8_0.gguf to /data/models/TabbyML/StarCoder-1B/ggml/model.gguf.tmp
00:03:02 ▕████████████████████▏ 1.23 GiB/1.23 GiB 6.88 MiB/s ETA 0s. ✅ Checksum OK.
Writing to new file.
🎯 Downloaded https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF/resolve/main/qwen2-1_5b-instruct-q8_0.gguf to /data/models/TabbyML/Qwen2-1.5B-Instruct/ggml/model.gguf.tmp
00:03:37 ▕████████████████████▏ 1.53 GiB/1.53 GiB 7.22 MiB/s ETA 0s. ✅ Checksum OK.
⠋ 2173.060 s Starting...2024-07-24T07:25:27.218916Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:99: llama-server <embedding> exited with status code -1
2024-07-24T07:25:27.218935Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from /data/models/TabbyML/Nomic-Embed-Text/ggml/model.gguf (version GGUF V3 (latest))
2024-07-24T07:25:27.218940Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-07-24T07:25:27.218943Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 0: general.architecture str = nomic-bert
2024-07-24T07:25:27.218946Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
2024-07-24T07:25:27.218950Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
2024-07-24T07:25:27.218953Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
2024-07-24T07:25:27.218960Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
2024-07-24T07:25:27.218962Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
2024-07-24T07:25:27.218964Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
2024-07-24T07:25:27.218965Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
2024-07-24T07:25:27.218968Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 8: general.file_type u32 = 7
2024-07-24T07:25:27.218971Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
2024-07-24T07:25:27.218974Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
2024-07-24T07:25:27.218982Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
2024-07-24T07:25:27.218983Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
2024-07-24T07:25:27.218985Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
2024-07-24T07:25:27.218986Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
2024-07-24T07:25:27.218988Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
2024-07-24T07:25:27.218991Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
2024-07-24T07:25:27.218992Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
2024-07-24T07:25:27.218994Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-07-24T07:25:27.218996Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
2024-07-24T07:25:27.218999Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
2024-07-24T07:25:27.219005Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
2024-07-24T07:25:27.219009Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 22: general.quantization_version u32 = 2
2024-07-24T07:25:27.219012Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type f32: 51 tensors
2024-07-24T07:25:27.219016Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type q8_0: 61 tensors
2024-07-24T07:25:27.219021Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: special tokens cache size = 5
2024-07-24T07:25:27.219026Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: token to piece cache size = 0.2032 MB
2024-07-24T07:25:27.219031Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: format = GGUF V3 (latest)
2024-07-24T07:25:27.219036Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: arch = nomic-bert
2024-07-24T07:25:27.219041Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab type = WPM
2024-07-24T07:25:27.219047Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_vocab = 30522
2024-07-24T07:25:27.219051Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_merges = 0
2024-07-24T07:25:27.219056Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab_only = 0
2024-07-24T07:25:27.219064Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_train = 2048
2024-07-24T07:25:27.219071Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd = 768
2024-07-24T07:25:27.219078Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_layer = 12
2024-07-24T07:25:27.219084Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head = 12
2024-07-24T07:25:27.219091Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head_kv = 12
2024-07-24T07:25:27.219099Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_rot = 64
2024-07-24T07:25:27.219105Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_swa = 0
2024-07-24T07:25:27.219111Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_k = 64
2024-07-24T07:25:27.219118Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_v = 64
2024-07-24T07:25:27.219125Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_gqa = 1
2024-07-24T07:25:27.219133Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_k_gqa = 768
2024-07-24T07:25:27.219139Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_v_gqa = 768
2024-07-24T07:25:27.219143Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_eps = 1.0e-12
2024-07-24T07:25:27.219149Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_rms_eps = 0.0e+00
2024-07-24T07:25:27.219157Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_clamp_kqv = 0.0e+00
2024-07-24T07:25:27.219176Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-07-24T07:25:27.219186Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_logit_scale = 0.0e+00
2024-07-24T07:25:27.219193Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ff = 3072
2024-07-24T07:25:27.219210Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert = 0
2024-07-24T07:25:27.219218Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert_used = 0
2024-07-24T07:25:27.219224Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: causal attn = 0
2024-07-24T07:25:27.219251Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: pooling type = 1
2024-07-24T07:25:27.219254Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope type = 2
2024-07-24T07:25:27.219257Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope scaling = linear
2024-07-24T07:25:27.219260Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_base_train = 1000.0
2024-07-24T07:25:27.219265Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_scale_train = 1
2024-07-24T07:25:27.219270Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_orig_yarn = 2048
2024-07-24T07:25:27.219275Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope_finetuned = unknown
2024-07-24T07:25:27.219280Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_conv = 0
2024-07-24T07:25:27.219286Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_inner = 0
2024-07-24T07:25:27.219291Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_state = 0
2024-07-24T07:25:27.219298Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_dt_rank = 0
2024-07-24T07:25:27.219304Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model type = 137M
2024-07-24T07:25:27.219309Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model ftype = Q8_0
2024-07-24T07:25:27.219315Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model params = 136.73 M
2024-07-24T07:25:27.219329Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model size = 138.65 MiB (8.51 BPW)
2024-07-24T07:25:27.219334Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: general.name = nomic-embed-text-v1.5
2024-07-24T07:25:27.219343Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: BOS token = 101 '[CLS]'
2024-07-24T07:25:27.219347Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: EOS token = 102 '[SEP]'
2024-07-24T07:25:27.219352Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: UNK token = 100 '[UNK]'
2024-07-24T07:25:27.219362Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: SEP token = 102 '[SEP]'
2024-07-24T07:25:27.219365Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: PAD token = 0 '[PAD]'
2024-07-24T07:25:27.219371Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: CLS token = 101 '[CLS]'
2024-07-24T07:25:27.219374Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: MASK token = 103 '[MASK]'
2024-07-24T07:25:27.219376Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: LF token = 0 '[PAD]'
2024-07-24T07:25:27.219378Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: max token length = 21
2024-07-24T07:25:27.219381Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2024-07-24T07:25:27.219387Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
2024-07-24T07:25:27.219390Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: found 1 CUDA devices:
2024-07-24T07:25:27.219392Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
⠸ 2174.102 s Starting...^C2024-07-24T07:25:28.289106Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:99: llama-server <embedding> exited with status code -1
2024-07-24T07:25:28.289123Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from /data/models/TabbyML/Nomic-Embed-Text/ggml/model.gguf (version GGUF V3 (latest))
2024-07-24T07:25:28.289126Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-07-24T07:25:28.289129Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 0: general.architecture str = nomic-bert
2024-07-24T07:25:28.289132Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
2024-07-24T07:25:28.289135Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
2024-07-24T07:25:28.289138Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
2024-07-24T07:25:28.289141Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
2024-07-24T07:25:28.289143Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
2024-07-24T07:25:28.289146Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
2024-07-24T07:25:28.289149Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
2024-07-24T07:25:28.289152Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 8: general.file_type u32 = 7
2024-07-24T07:25:28.289155Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
2024-07-24T07:25:28.289157Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
2024-07-24T07:25:28.289160Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
2024-07-24T07:25:28.289162Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
2024-07-24T07:25:28.289165Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
2024-07-24T07:25:28.289168Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
2024-07-24T07:25:28.289170Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
2024-07-24T07:25:28.289173Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
2024-07-24T07:25:28.289176Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
2024-07-24T07:25:28.289178Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-07-24T07:25:28.289181Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
2024-07-24T07:25:28.289184Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
2024-07-24T07:25:28.289187Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
2024-07-24T07:25:28.289189Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv 22: general.quantization_version u32 = 2
2024-07-24T07:25:28.289192Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type f32: 51 tensors
2024-07-24T07:25:28.289195Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type q8_0: 61 tensors
2024-07-24T07:25:28.289197Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: special tokens cache size = 5
2024-07-24T07:25:28.289200Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: token to piece cache size = 0.2032 MB
2024-07-24T07:25:28.289203Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: format = GGUF V3 (latest)
2024-07-24T07:25:28.289205Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: arch = nomic-bert
2024-07-24T07:25:28.289208Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab type = WPM
2024-07-24T07:25:28.289211Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_vocab = 30522
2024-07-24T07:25:28.289213Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_merges = 0
2024-07-24T07:25:28.289216Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab_only = 0
2024-07-24T07:25:28.289219Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_train = 2048
2024-07-24T07:25:28.289222Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd = 768
2024-07-24T07:25:28.289224Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_layer = 12
2024-07-24T07:25:28.289227Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head = 12
2024-07-24T07:25:28.289230Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head_kv = 12
2024-07-24T07:25:28.289233Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_rot = 64
2024-07-24T07:25:28.289256Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_swa = 0
2024-07-24T07:25:28.289259Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_k = 64
2024-07-24T07:25:28.289261Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_v = 64
2024-07-24T07:25:28.289264Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_gqa = 1
2024-07-24T07:25:28.289266Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_k_gqa = 768
2024-07-24T07:25:28.289269Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_v_gqa = 768
2024-07-24T07:25:28.289272Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_eps = 1.0e-12
2024-07-24T07:25:28.289274Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_rms_eps = 0.0e+00
2024-07-24T07:25:28.289277Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_clamp_kqv = 0.0e+00
2024-07-24T07:25:28.289292Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-07-24T07:25:28.289295Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_logit_scale = 0.0e+00
2024-07-24T07:25:28.289305Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ff = 3072
2024-07-24T07:25:28.289308Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert = 0
2024-07-24T07:25:28.289311Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert_used = 0
2024-07-24T07:25:28.289325Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: causal attn = 0
2024-07-24T07:25:28.289328Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: pooling type = 1
2024-07-24T07:25:28.289330Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope type = 2
2024-07-24T07:25:28.289332Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope scaling = linear
2024-07-24T07:25:28.289335Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_base_train = 1000.0
2024-07-24T07:25:28.289338Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_scale_train = 1
2024-07-24T07:25:28.289348Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_orig_yarn = 2048
2024-07-24T07:25:28.289351Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope_finetuned = unknown
2024-07-24T07:25:28.289353Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_conv = 0
2024-07-24T07:25:28.289355Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_inner = 0
2024-07-24T07:25:28.289357Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_state = 0
2024-07-24T07:25:28.289364Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_dt_rank = 0
2024-07-24T07:25:28.289366Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model type = 137M
2024-07-24T07:25:28.289367Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model ftype = Q8_0
2024-07-24T07:25:28.289369Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model params = 136.73 M
2024-07-24T07:25:28.289371Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model size = 138.65 MiB (8.51 BPW)
2024-07-24T07:25:28.289373Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: general.name = nomic-embed-text-v1.5
2024-07-24T07:25:28.289378Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: BOS token = 101 '[CLS]'
2024-07-24T07:25:28.289380Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: EOS token = 102 '[SEP]'
2024-07-24T07:25:28.289383Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: UNK token = 100 '[UNK]'
2024-07-24T07:25:28.289386Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: SEP token = 102 '[SEP]'
2024-07-24T07:25:28.289388Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: PAD token = 0 '[PAD]'
2024-07-24T07:25:28.289390Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: CLS token = 101 '[CLS]'
2024-07-24T07:25:28.289392Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: MASK token = 103 '[MASK]'
2024-07-24T07:25:28.289394Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: LF token = 0 '[PAD]'
2024-07-24T07:25:28.289397Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: max token length = 21
2024-07-24T07:25:28.289399Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2024-07-24T07:25:28.289403Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
2024-07-24T07:25:28.289405Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: found 1 CUDA devices:
2024-07-24T07:25:28.289408Z WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
Information about your version 0.14.0 or 0.13.1
Information about your GPU
Wed Jul 24 15:30:02 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 On | 00000000:01:00.0 Off | N/A |
| 50% 43C P8 19W / 320W | 29MiB / 20480MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1995 G /usr/libexec/Xorg 12MiB |
| 0 N/A N/A 3008 G gnome-shell 4MiB |
| 0 N/A N/A 3923 G /usr/libexec/gnome-initial-setup 3MiB |
+---------------------------------------------------------------------------------------+
I have the same issue, how to troubleshoot it?
Same here. Running with
docker run -it --gpus all -p 8080:8080 -v $HOME/.tabby:/data tabbyml/tabby serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct --device cuda
GPU info:
Thu Aug 15 10:11:51 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1650 Off | 00000000:01:00.0 Off | N/A |
| N/A 44C P0 6W / 50W | 3MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2642 G /usr/bin/gnome-shell 1MiB |
+---------------------------------------------------------------------------------------+
Same issue as well. Going back to tabby v12.0 seems to work for me. (When serving CodeGemma-7B, without webserver.)
Thank you for reporting the issues. The changes in https://github.com/TabbyML/tabby/pull/2925/files will be included in the 0.16 release and will provide more detailed information in the logs to assist with debugging.