chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: thread 'tokio-runtime-worker' panicked

Open trofdev opened this issue 7 months ago • 4 comments

What happened?

chromadb latest/1.0.6

Chroma panics randomly after a while, didn't happen on earlier versions, but our data amount also increased.

Versions

chroma 1.0.6 latest, python:3.11, host os is Ubuntu 24.04.2 LTS shouldn't be relevant, running as docker containers

Relevant log output

thread 'tokio-runtime-worker' panicked at /chroma/rust/system/src/wrapped_message.rs:88:30:
message reply channel was unexpectedly dropped by caller: Ok(())
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panicking.rs:74:14
   2: core::result::unwrap_failed
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/result.rs:1679:5
   3: core::result::Result<T,E>::expect
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/result.rs:1059:23
   4: <core::option::Option<chroma_system::wrapped_message::HandleableMessageImpl<M,<C as chroma_system::types::Handler<M>>::Result>> as chroma_system::wrapped_message::HandleableMessage<C>>::handle_and_reply::{{closure}}
             at ./chroma/rust/system/src/wrapped_message.rs:86:25
   5: <core::pin::Pin<P> as core::future::future::Future>::poll
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/future/future.rs:123:9
   6: chroma_system::wrapped_message::WrappedMessage<C>::handle::{{closure}}
             at ./chroma/rust/system/src/wrapped_message.rs:55:61
   7: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
   8: chroma_system::executor::ComponentExecutor<C>::run::{{closure}}
             at ./chroma/rust/system/src/executor.rs:98:64
   9: chroma_system::system::System::start_component::{{closure}}
             at ./chroma/rust/system/src/system.rs:54:65
  10: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
  11: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/core.rs:331:17
  12: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/loom/std/unsafe_cell.rs:16:9
  13: tokio::runtime::task::core::Core<T,S>::poll
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/core.rs:320:30
  14: tokio::runtime::task::harness::poll_future::{{closure}}
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/harness.rs:499:19
  15: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panic/unwind_safe.rs:272:9
  16: std::panicking::try::do_call
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:557:40
  17: std::panicking::try
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:521:19
  18: std::panic::catch_unwind
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panic.rs:350:14
  19: tokio::runtime::task::harness::poll_future
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/harness.rs:487:18
  20: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/harness.rs:209:27
  21: tokio::runtime::task::harness::Harness<T,S>::poll
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/harness.rs:154:15
  22: tokio::runtime::task::raw::RawTask::poll
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/raw.rs:201:18
  23: tokio::runtime::task::LocalNotified<S>::run
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/mod.rs:435:9
  24: tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}}
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/scheduler/multi_thread/worker.rs:596:18
  25: tokio::runtime::coop::with_budget
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/coop.rs:107:5
  26: tokio::runtime::coop::budget
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/coop.rs:73:5
  27: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/scheduler/multi_thread/worker.rs:595:9
  28: tokio::runtime::scheduler::multi_thread::worker::Context::run
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/scheduler/multi_thread/worker.rs:546:24
  29: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/scheduler/multi_thread/worker.rs:511:21
  30: tokio::runtime::context::scoped::Scoped<T>::set
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/context/scoped.rs:40:9
  31: tokio::runtime::context::set_scheduler::{{closure}}
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/context.rs:180:26
  32: std::thread::local::LocalKey<T>::try_with
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/thread/local.rs:283:12
  33: std::thread::local::LocalKey<T>::with
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/thread/local.rs:260:9
  34: tokio::runtime::context::set_scheduler
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/context.rs:180:17
  35: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/scheduler/multi_thread/worker.rs:506:9
  36: tokio::runtime::context::runtime::enter_runtime
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/context/runtime.rs:65:16
  37: tokio::runtime::scheduler::multi_thread::worker::run
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/scheduler/multi_thread/worker.rs:498:5
  38: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/scheduler/multi_thread/worker.rs:464:45
  39: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/blocking/task.rs:42:21
  40: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/core.rs:331:17
  41: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/loom/std/unsafe_cell.rs:16:9
  42: tokio::runtime::task::core::Core<T,S>::poll
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/core.rs:320:30
  43: tokio::runtime::task::harness::poll_future::{{closure}}
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/harness.rs:499:19
  44: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panic/unwind_safe.rs:272:9
  45: std::panicking::try::do_call
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:557:40
  46: std::panicking::try
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:521:19
  47: std::panic::catch_unwind
             at ./rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panic.rs:350:14
  48: tokio::runtime::task::harness::poll_future
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/harness.rs:487:18
  49: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/harness.rs:209:27
  50: tokio::runtime::task::harness::Harness<T,S>::poll
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/harness.rs:154:15
  51: tokio::runtime::task::raw::RawTask::poll
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/raw.rs:201:18
  52: tokio::runtime::task::UnownedTask<S>::run
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/task/mod.rs:472:9
  53: tokio::runtime::blocking::pool::Task::run
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/blocking/pool.rs:161:9
  54: tokio::runtime::blocking::pool::Inner::run
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/blocking/pool.rs:511:17
  55: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at ./usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.41.1/src/runtime/blocking/pool.rs:469:13

trofdev avatar Apr 23 '25 22:04 trofdev

@dertruffel, can you share during what type of operations your Chroma instance has panicked?

tazarov avatar Apr 24 '25 10:04 tazarov

90% of operations is retrieval from the base, although i don't log that kind of data. Is there a way to auto-restart upon panic? @tazarov Additionally i upped the version to 1.0.7.dev17 and it still happens, same trace

trofdev avatar Apr 24 '25 11:04 trofdev

@dertruffel Can you share some of your code? What kind of client are you using? More information will be very helpful in order to debug this

itaismith avatar Apr 25 '25 00:04 itaismith

@itaismith

CHROMA_PORT = os.environ.get("CHROMA_PORT", "8000")
CHROMA_URL = f"http://{CHROMA_HOST}:{CHROMA_PORT}"
CHROMA_CLIENT = HttpClient(host=CHROMA_URL)```

def list_files_in_collection(collection_name):
    try:
        collection = CHROMA_CLIENT.get_or_create_collection(name=collection_name)
        results = collection.get(include=["metadatas"])
        metadatas = results["metadatas"]
        file_paths = [meta.get("file_path") for meta in metadatas]
        return list(set(file_paths))
    except Exception as ex:
        print("CHROMA LIST ISSUE", ex)
        return list(set([]))

def get_gemini_embedding(query):
    try:
        gemini.configure(api_key=GEMINI_KEY)
        embed_model = 'models/text-embedding-004'
        embedding_resp = gemini.embed_content(model=embed_model, content=query)
        return embedding_resp["embedding"]
    except Exception as ex:
        print("CHROMA GEMINI ISSUE", ex)
        return None

def query_chroma(query, collection_name, top_k=3):
    try:
        # query_embedding = embed_with_gemini(query)
        query_embedding = get_gemini_embedding(query)

        collection = CHROMA_CLIENT.get_or_create_collection(name=collection_name)
        results = collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
            include = ["documents", "metadatas", "distances"]
        )
        top_chunks = [(doc, meta) for doc, meta in zip(results["documents"][0], results["metadatas"][0])]
        return top_chunks
    except Exception as ex:
        print("CHROMA QUERY ISSUE", ex)
        return []

it panics randomly on write and read, but also when docker is pulling a new build so every time i start a new django container, i need to reboot chroma, running chromadb/chroma on docker, version 1.0.5, 1.0.6, 1.0.7.devs all panic

trofdev avatar Apr 25 '25 16:04 trofdev

I got exactly the same error. If there is anything I can do to help, please let me know.

AccsoAndreBuesgen avatar May 20 '25 15:05 AccsoAndreBuesgen

@AccsoAndreBuesgen until chroma devs fix the issue, the only band-aid solution i found is restarting the container whenever it panics, by monitoring the logs and on panic restarting the container. Not ideal

trofdev avatar May 20 '25 21:05 trofdev

Just a bump here, I just upgraded from 0.6.2 to 1.0.12 and I've has this already 3x today whilst testing after upgrading in my dev environment, just searching to see if there is a resolution and found this thread. This is chroma running in a docker container on a mac.

Connect to Chroma at: [http://localhost:8000⁠](http://localhost:8000/)
Getting started guide: [https://docs.trychroma.com/docs/overview/getting-started⁠](https://docs.trychroma.com/docs/overview/getting-started)
OpenTelemetry is not enabled because it is missing from the config
Listening on 0.0.0.0:8000
thread 'tokio-runtime-worker' panicked at /chroma/rust/system/src/wrapped_message.rs:88:30:
message reply channel was unexpectedly dropped by caller: Ok(())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Having to restart the container to resolve.

katewilkins@kate-wilkins-L37L4QFWLN backend % go run cmd/utils/collection_details/main.go
Fetching collection details...
Collection  ID                                    Embedding Size  EF Construction  EF Search  HNSW Space
---------   --                                    --------------  --------------   ---------  ----------
address     83b90a1e-10fb-4f7b-a5be-8f7fd68ff354  768             500              300        ip
email       9b812b3d-5888-40f9-997b-336a7b9b93c5  768             500              300        ip
location    dc61e956-b0e9-418a-ae36-1eab6d20664c  3               500              300        ip
name        996f4314-3a00-47bc-b9f0-13c0b4c23a54  768             500              300        ip
katewilkins@kate-wilkins-L37L4QFWLN backend % go run cmd/utils/collection_sizes/main.go  
Fetching collection sizes...
Collection  Size
---------   ----
address     1813
email       4602
location    5079
name        3684

Total collections: 4
Total items: 15178

Its a little worrying as I've been forced to migrate from 0.6.2 due to 3 index corruptions in 3 weeks in production where those collections are in the order of 6-18M rows each, I'm a bit worried I'm swapping one issue for another.

katew-deriv avatar Jun 09 '25 13:06 katew-deriv

Second this.

thesethtruth avatar Jun 11 '25 10:06 thesethtruth

BUMP

I can now reproduce this issue. I'm accessing chroma via the v2 apis and hammering them adding to a collection to do a rebuild, if I kill my rebuild process midway through a call to to chroma it will trigger this issue and the only solution is a complete restart of the chroma docker container. Happens every time its in a call and I kill it.

IMHO this is a very serious bug in the handling of threads, see this link:

https://users.rust-lang.org/t/tokio-runtime-panics-at-shutdown/93651/4

katew-deriv avatar Jun 11 '25 13:06 katew-deriv

Thanks for reporting, we're putting priority behind this on our end. Will post here when we have an update.

HammadB avatar Jun 11 '25 16:06 HammadB