continue Cannot index projects, embeddings errors 500 from ollama

Cannot index projects, embeddings errors 500 from ollama

Open Meister1593 opened this issue 1 year ago • 3 comments

Before submitting your bug report

[X] I believe this is a bug. I'll try to join the Continue Discord for questions
[X] I'm not able to find an open issue that reports the same bug
[X] I've seen the troubleshooting guide on the Continue Docs

Relevant environment info

- OS: Archlinux
- Continue: 0.8.43
- IDE: VSCodium 1.89.1
- Model: Ollama 0.2.8, nomic-embed-text-v1.5
- config.json:
  
{
  "models": [
    {
      "title": "Llama 3",
      "provider": "ollama",
      "model": "llama3"
    },
    {
      "title": "Ollama",
      "provider": "ollama",
      "contextLength": 1024,
      "model": "AUTODETECT"
    }
  ],
  "customCommands": [
    {
      "name": "test",
      "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Starcoder 3b",
    "provider": "ollama",
    "model": "starcoder2:3b"
  },
  "allowAnonymousTelemetry": false,
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text"
  }
}

Description

When trying to index any project, it infinitely tries to index and never finishes, also causing major cpu/memory usage in the process

To reproduce

Install continue release
Open any folder with project files
Indexing starts
Indexing never finishes, stays indexing

Log output

Plugin logs:
/home/plyshka/Documents/Code/Kubernetes/plyshserver/coturn/configmap.yaml with provider: OllamaEmbeddingsProvider::nomic-embed-text: Error: Failed to embed chunk: {"error":"llama runner process has terminated: signal: aborted (core dumped)"}

Ollama logs:
time=2024-07-29T03:16:58.535+05:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)"
[GIN] 2024/07/29 - 03:16:58 | 500 |  3.413399187s |       127.0.0.1 | POST     "/api/embeddings"
time=2024-07-29T03:16:58.546+05:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=13 layers.offload=0 layers.split="" memory.available="[14.1 GiB]" memory.required.full="574.9 MiB" memory.required.partial="0 B" memory.required.kv="96.0 MiB" memory.required.allocations="[574.9 MiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB"
time=2024-07-29T03:16:58.546+05:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama4031942035/runners/cpu_avx2/ollama_llama_server --model /home/plyshka/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 37837"
time=2024-07-29T03:16:58.546+05:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-29T03:16:58.546+05:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
time=2024-07-29T03:16:58.546+05:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3440 commit="d94c6e0cc" tid="139283458398016" timestamp=1722205018
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="139283458398016" timestamp=1722205018 total_threads=12
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="37837" tid="139283458398016" timestamp=1722205018
llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /home/plyshka/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  23:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type  f16:   61 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.2032 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = nomic-bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 137M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 136.73 M
llm_load_print_meta: model size       = 260.86 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = nomic-embed-text-v1.5
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: max token length = 21
llm_load_tensors: ggml ctx size =    0.05 MiB
llm_load_tensors:        CPU buffer size =   260.86 MiB
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1152.00 MiB
llama_new_context_with_model: KV self size  = 1152.00 MiB, K (f16):  576.00 MiB, V (f16):  576.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
[1722205018] warming up the model with an empty run
llama_new_context_with_model:        CPU compute buffer size =    22.01 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 1
/usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.
time=2024-07-29T03:16:58.998+05:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server not responding"

Jul 28 '24 22:07 Meister1593

continue continue copied to clipboard

Cannot index projects, embeddings errors 500 from ollama

Before submitting your bug report

Relevant environment info

Description

To reproduce

Log output

continue
continue copied to clipboard