private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Creating embeddings with ollama extremely slow

Open jabbor opened this issue 1 year ago • 15 comments
trafficstars

This is a Windows setup, using also ollama for windows.

System:

  • Windows 11
  • 64GB memory
  • RTX 4090 (cuda installed)

Setup: poetry install --extras "ui vector-stores-qdrant llms-ollama embeddings-ollama"

Ollama: pull mixtral, then pull nomic-embed-text.

This is what the logging says (startup, and then loading a 1kb txt file). It is taking a long time.

Did I do something wrong?

Using python3 (3.11.8) 13:21:55.666 [INFO ] private_gpt.settings.settings_loader - Starting application with profiles=['default', 'ollama'] None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████| 1.46k/1.46k [00:00<?, ?B/s] 13:22:03.875 [WARNING ] py.warnings - C:\Users\jwbor\AppData\Local\pypoetry\Cache\virtualenvs\private-gpt-TFCUF6yI-py3.11\Lib\site-packages\huggingface_hub\file_download.py:147: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in D:\privategpt\models\cache. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message)

tokenizer.model: 100%|██████████████████████████████████████████████████████████████| 493k/493k [00:00<00:00, 39.6MB/s] tokenizer.json: 100%|█████████████████████████████████████████████████████████████| 1.80M/1.80M [00:00<00:00, 3.74MB/s] special_tokens_map.json: 100%|███████████████████████████████████████████████████████| 72.0/72.0 [00:00<00:00, 144kB/s] 13:22:05.412 [INFO ] private_gpt.components.llm.llm_component - Initializing the LLM in mode=ollama 13:22:06.695 [INFO ] private_gpt.components.embedding.embedding_component - Initializing the embedding model in mode=ollama 13:22:06.706 [INFO ] llama_index.core.indices.loading - Loading all indices. 13:22:06.706 [INFO ] private_gpt.components.ingest.ingest_component - Creating a new vector store index Parsing nodes: 0it [00:00, ?it/s] Generating embeddings: 0it [00:00, ?it/s] 13:22:06.827 [INFO ] private_gpt.ui.ui - Mounting the gradio UI, at path=/ 13:22:06.983 [INFO ] uvicorn.error - Started server process [1572] 13:22:06.983 [INFO ] uvicorn.error - Waiting for application startup. 13:22:06.983 [INFO ] uvicorn.error - Application startup complete. 13:22:06.983 [INFO ] uvicorn.error - Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit) 13:22:33.469 [INFO ] uvicorn.access - 127.0.0.1:57963 - "GET / HTTP/1.1" 200 13:22:33.559 [INFO ] uvicorn.access - 127.0.0.1:57963 - "GET /info HTTP/1.1" 200 13:22:33.563 [INFO ] uvicorn.access - 127.0.0.1:57963 - "GET /theme.css HTTP/1.1" 200 13:22:33.768 [INFO ] uvicorn.access - 127.0.0.1:57963 - "POST /run/predict HTTP/1.1" 200 13:22:33.774 [INFO ] uvicorn.access - 127.0.0.1:57963 - "POST /queue/join HTTP/1.1" 200 13:22:33.777 [INFO ] uvicorn.access - 127.0.0.1:57963 - "GET /queue/data?session_hash=94yqqpkh9p HTTP/1.1" 200 13:22:42.139 [INFO ] uvicorn.access - 127.0.0.1:57964 - "POST /upload HTTP/1.1" 200 13:22:42.144 [INFO ] uvicorn.access - 127.0.0.1:57964 - "POST /queue/join HTTP/1.1" 200 13:22:42.148 [INFO ] uvicorn.access - 127.0.0.1:57964 - "GET /queue/data?session_hash=94yqqpkh9p HTTP/1.1" 200 13:22:42.209 [INFO ] private_gpt.server.ingest.ingest_service - Ingesting file_names=['boericke_zizia.txt'] Parsing nodes: 100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1001.03it/s] Generating embeddings: 100%|███████████████████████████████████████████████████████████| 18/18 [00:37<00:00, 2.10s/it] Generating embeddings: 0it [00:00, ?it/s] 13:23:21.988 [INFO ] private_gpt.server.ingest.ingest_service - Finished ingestion file_name=['boericke_zizia.txt'] 13:23:22.054 [INFO ] uvicorn.access - 127.0.0.1:57964 - "POST /queue/join HTTP/1.1" 200 13:23:22.057 [INFO ] uvicorn.access - 127.0.0.1:57964 - "GET /queue/data?session_hash=94yqqpkh9p HTTP/1.1" 200 13:23:22.167 [INFO ] uvicorn.access - 127.0.0.1:57964 - "POST /queue/join HTTP/1.1" 200 13:23:22.171 [INFO ] uvicorn.access - 127.0.0.1:57964 - "GET /queue/data?session_hash=94yqqpkh9p HTTP/1.1" 200

jabbor avatar Mar 22 '24 12:03 jabbor

I am in for the answer here. Ollama is very slow for me. I switched to llama and it is much faster. There is something broken with ollama and ingestion

JTMarsh556 avatar Mar 22 '24 13:03 JTMarsh556

What ingest_mode are you using? That has a significant impact on the timing. pipeline is the fastest https://github.com/zylon-ai/private-gpt/pull/1750. Also there is a problem with ollama fixed in version 0.1.29 ref https://github.com/zylon-ai/private-gpt/issues/1691

dbzoo avatar Mar 22 '24 14:03 dbzoo

I use simple, ollama version 1.29

jabbor avatar Mar 22 '24 15:03 jabbor

I updated the settings-ollama.yaml file to what you linked and verified my ollama version was 0.1.29 but Im not seeing much of a speed improvement and my GPU seems like it isnt getting tasked. Neither the the available RAM or CPU seem to be driven much either. Three files totaling roughly 6.5MB are taking close to 30 mins (20 mins @ 8 workers) to ingest where llama completed the ingest in less than a minute. Am I doing something wrong?

JTMarsh556 avatar Mar 22 '24 18:03 JTMarsh556

I've switched over to lmstudio (0.2.17) mixtral instruct 8x q4 k m, and start the server in lmstudio

I installed privategpt with the following installation command: poetry install --extras "ui llms-openai-like embeddings-huggingface vector-stores-qdrant"

settings-vllm.yaml: server: env_name: ${APP_ENV:vllm}

llm: mode: openailike

embedding: mode: huggingface ingest_mode: simple

local: embedding_hf_model_name: nomic-embed-text

openai: api_base: http://localhost:1234/v1 api_key: EMPTY model: mixtral

set PGPT_PROFILES=vllm make run

And: it ingests now much faster!

But: I hope that you can amend privategpt, so that it also runs fast with ollama!

jabbor avatar Mar 22 '24 20:03 jabbor

the pipeline from @dbzoo is super fast :rocket: ! For my tests (>100k docs) the bottleneck is still the index writing.

paul-asvb avatar Mar 26 '24 15:03 paul-asvb

@paul-asvb Index writing will always be a bottleneck. With pipeline mode the index will update in the background whilst still ingesting (doing embed work). Depending on how long the index update takes I have seen the embed worker output Q fill up which stalls the workers, this is in purpose as per the design. We could tweak the Q sizes a little, but at the end of the day writing everything to one index will always be an issue. Glad you noticed that pipeline was fast, I thought so.

dbzoo avatar Mar 27 '24 00:03 dbzoo

It could also be the swapping from LLM to embedding model and back that makes it very slow see my PR #1800

Robinsane avatar Mar 27 '24 16:03 Robinsane

Same here. There seems to be a "hard limit" somewhere setting the pace to 2.06 - > 2.11 s/it on "Generating Embeddings" - when embedding multiple files in "pipeline" all workers are crippled to that speed

Stego72 avatar Apr 15 '24 08:04 Stego72

pipeline does not help for me at all.

stevenlafl avatar May 02 '24 13:05 stevenlafl