llama_index
llama_index copied to clipboard
Align Ollama DEFAULT_CONTEXT_WINDOW to match with Ollama CLI default: 2048
Description
Ollama library in llama index is set to CONTEXT WINDOW to 3094 n_ctx = 3904
which is higher than Ollama CLI default of 2048 n_ctx = 2048
causing weird output when running even simple queries on llama3:instruct
model. This thereby deters new users/developers from being able to get llama-index up and running quickly with Ollama (llama3). In contrast, langchain works out of the box with the same configuration that aligns with the Ollama CLI so that performance/consistency is retained out of the box. In other words, running a query on Ollama interactive CLI ollama run
, and running it on Llama-index library (with defaults) should be identical.
Fixes
Fixes Timeout, Junk Output, and ggml_metal_graph_compute: command buffer 3 failed with status 5
error due to mismatch in default Context Window between Ollama CLI and Llama-index integration.
New Package?
Did I fill in the tool.llamahub
section in the pyproject.toml
and provide a detailed README.md for my new integration or package?
- [ ] Yes
- [x] No - It's not a new integration
Version Bump?
Did I bump the version in the pyproject.toml
file of the package I am updating? (Except for the llama-index-core
package)
- [ ] Yes
- [x] No
Type of Change
Please delete options that are not relevant.
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] This change requires a documentation update
How Has This Been Tested?
Just a change in default value.
To reproduce:
Hardware/bootstrap logs:
llama_new_context_with_model: n_ctx = 3904
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 5726.63 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 488.00 MiB, ( 5056.88 / 5461.34)
llama_kv_cache_init: Metal KV buffer size = 488.00 MiB
llama_new_context_with_model: KV self size = 488.00 MiB, K (f16): 244.00 MiB, V (f16): 244.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 283.64 MiB, ( 5340.52 / 5461.34)
llama_new_context_with_model: Metal compute buffer size = 283.63 MiB
llama_new_context_with_model: CPU compute buffer size = 15.63 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
Command:
curl http://localhost:11434/api/generate -d '{
"model": "llama3:instruct",
"prompt": "Why is the sky blue?", "options": {
"num_ctx": 3904
}
}'
On Llama-index:
llm = Ollama(model="llama3", request_timeout=120)
llm.complete("Why is the sky blue?")
Output:
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.109309Z","response":"3","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.192524Z","response":"D","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.261377Z","response":"8","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.330906Z","response":";","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.399573Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.479188Z","response":"#","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.542696Z","response":"/","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.616321Z","response":"D","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.689998Z","response":")","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.759423Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.82956Z","response":"8","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.895623Z","response":"5","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:40.959387Z","response":":","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.029689Z","response":".","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.095651Z","response":"5","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.168368Z","response":"7","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.241777Z","response":"0","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.323894Z","response":"6","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.397235Z","response":"/","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.467832Z","response":"2","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.565109Z","response":"7","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.654553Z","response":"6","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.727295Z","response":"G","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.793783Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.864981Z","response":"D","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:41.95098Z","response":"G","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.024172Z","response":"'","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.103524Z","response":"2","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.187653Z","response":":","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.253375Z","response":"3","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.337022Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.399163Z","response":".","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.467277Z","response":"\"","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.537366Z","response":"4","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.597877Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.671598Z","response":"G","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.734082Z","response":"%","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.798676Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.864071Z","response":"\u003e","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.931211Z","response":"(","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:42.993299Z","response":"*","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.054537Z","response":".","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.119623Z","response":"\u0026","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.186948Z","response":"C","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.264504Z","response":"C","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.330198Z","response":"F","done":false}
{"model":"llama3:instruct","created_at":"2024-04-28T03:16:43.395072Z","response":"=","done":false}
Logs from Ollama server:
llama_new_context_with_model: n_ctx = 3904
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 5726.63 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 488.00 MiB, ( 5056.88 / 5461.34)
llama_kv_cache_init: Metal KV buffer size = 488.00 MiB
llama_new_context_with_model: KV self size = 488.00 MiB, K (f16): 244.00 MiB, V (f16): 244.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 283.64 MiB, ( 5340.52 / 5461.34)
llama_new_context_with_model: Metal compute buffer size = 283.63 MiB
llama_new_context_with_model: CPU compute buffer size = 15.63 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
ggml_metal_graph_compute: command buffer 3 failed with status 5
{"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"initialize","level":"INFO","line":460,"msg":"new slot","n_ctx_slot":3904,"slot_id":0,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"0x1df38d000","timestamp":1714271140}
{"function":"validate_model_chat_template","level":"ERR","line":437,"msg":"The chat template comes with this model is not yet supported, falling back to chatml. This may cause the model to output suboptimal responses","tid":"0x1df38d000","timestamp":1714271140}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"7","port":"65236","tid":"0x1df38d000","timestamp":1714271140}
{"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"0x1df38d000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":0,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":1,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":2,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65241,"status":200,"tid":"0x17dd13000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65237,"status":200,"tid":"0x17dae3000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":3,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65239,"status":200,"tid":"0x17dbfb000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":4,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65238,"status":200,"tid":"0x17db6f000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65240,"status":200,"tid":"0x17dc87000","timestamp":1714271140}
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":5,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65273,"status":200,"tid":"0x17dd9f000","timestamp":1714271140}
time=2024-04-28T12:25:40.713+10:00 level=DEBUG source=server.go:431 msg="llama runner started in 7.403734 seconds"
time=2024-04-28T12:25:40.718+10:00 level=DEBUG source=routes.go:259 msg="generate handler" prompt="Why is the sky blue?"
time=2024-04-28T12:25:40.719+10:00 level=DEBUG source=routes.go:260 msg="generate handler" template="{{ if .System }}<|start_header_id|>system<|end_header_id|>\n\n{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>\n\n{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>\n\n{{ .Response }}<|eot_id|>"
time=2024-04-28T12:25:40.719+10:00 level=DEBUG source=routes.go:261 msg="generate handler" system=""
time=2024-04-28T12:25:40.723+10:00 level=DEBUG source=routes.go:292 msg="generate handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nWhy is the sky blue?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
{"function":"process_single_task","level":"INFO","line":1510,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":6,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"log_server_request","level":"INFO","line":2741,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":65273,"status":200,"tid":"0x17dd9f000","timestamp":1714271140}
{"function":"launch_slot_with_data","level":"INFO","line":833,"msg":"slot is processing task","slot_id":0,"task_id":7,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1816,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":16,"slot_id":0,"task_id":7,"tid":"0x1df38d000","timestamp":1714271140}
{"function":"update_slots","level":"INFO","line":1840,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":7,"tid":"0x1df38d000","timestamp":1714271140}
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
ggml_metal_graph_compute: command buffer 3 failed with status 5
- [ ] Added new unit/integration tests
- [ ] Added new notebook (that tests end-to-end)
- [ ] I stared at the code and made sure it makes sense
Suggested Checklist:
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] I have added Google Colab support for the newly added notebooks.
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes
- [x] I ran
make format; make lint
to appease the lint gods
2048 is extremely small for most RAG use cases.
Personally I've never had issues using ollama 😅 super confused why this specific setting would cause issues? Llama3 for example has like an 8k context window?
@logan-markewich I'd assume we'd want to set the defaults to a value that is a common denominator - something that can work for everyone out of the box without tweaking. The library in its current state fails silently as a timeout without any trace of the underlying issue for a machine like Apple M1 8GB memory – which I think can be a reasonable baseline we can set. Furthermore, the CLI ollama run llama3
instantiates with a context window of 2048 where user will a same query pass in CLI, but fail on llama-index.
For context, this is referring to llama3:instruct (quantized model) ~ 4.7GB.
@logan-markewich Not to mention langchain:llama3 works out of the box in its default settings but llama-index:llama3 doesn't (in the context of any machine equivalent to 8GB Apple M1 machine)
I agree with Logan here that 2K is too small for many RAG applications. In fact, we should be going higher to 8K for Llama 3 and 64K for Mixtral 4x22b.
That said, I hear @komal-SkyNET about the difficulties when running on machines with 8GB of system RAM, so let's reach out to ollama to see if they can give us back some kind of error in that scenario. If not, we can do a quick and dirty hack using psutil
. Actually doing some kind of psutil check might not be a bad idea regardless to prevent us from locking up users' computers like the first time I tried using our ollama integration (and that was with 16GB of RAM!)
Hi folks I work on Ollama - sorry you hit this issue! A fix is on the way and will be in the next release https://github.com/ollama/ollama/pull/4068
In light of the above, going to close this.