quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Documents loose after their ingestion

Open KlimTodrik opened this issue 1 year ago • 5 comments

Describe the bug After the ingestion 1.8T of data we lost 9 documents. We log every response from quickwit during ingestion. And we see in logs

{"num_docs_for_processing":7000} #247545 times
{"num_docs_for_processing":1000} #2 times
{"num_docs_for_processing":71} #1 time

So we expected 1732817071 docs as a result but got 1732817062

# curl -H "Content-type: application/json" -X POST \
> http://localhost:7280/api/v1/taxi/search/ \
> -d '{"query":"*","max_hits":0,"aggs":{"count(*)":{"value_count":{"field":"id"}}}}'
{ 
  "num_hits": 1732817062,
  "hits": [],
  "elapsed_time_micros": 1679406,
  "errors": [],
  "aggregations": {
    "count(*)": {
      "value": 1732817062.0
    }
  }
}

Steps to reproduce (if applicable)

This is a big amount of data so we can't provide the dump easily.

You can reproduce this issue via databases comparing tool

  1. Clone comparing tool
git clone [email protected]:db-benchmarks/db-benchmarks.git
cd db-benchmarks
git checkout feat/quickwit
  1. Copy .env.example to .env
  2. Update cpuset in .env with the default value of CPUs that your machine has
  3. Open the test folder
cd tests/taxi
  1. Add exit 1 to prevent other engines init (It doesn't affect our issue and save us space)
  2. Run ./init

Ingestion will take 3-4 days after you will see the problem.

Expected behavior 1732817071 count of docs as results

Configuration: Please provide:

  1. Output of quickwit --version 0.8.1
  2. The index_config.yaml

KlimTodrik avatar Sep 30 '24 15:09 KlimTodrik

If documents don't match the schema, they won't be indexed, which may cause the mismatch

PSeitz avatar Oct 01 '24 01:10 PSeitz

If documents don't match the schema, they won't be indexed, which may cause the mismatch

Should it answer with some error? Cause we don't see any error responses

KlimTodrik avatar Oct 01 '24 21:10 KlimTodrik

No, I think it only logs errors currently

PSeitz avatar Oct 02 '24 00:10 PSeitz

No, I think it only logs errors currently

There are 1,732,817,071 documents, so analyzing all logs to find the error is quite complex. I think it would be much better to notify the user directly when something goes wrong, either via the response (not just a 200 status) or by providing a dedicated errors endpoint

KlimTodrik avatar Jan 10 '25 15:01 KlimTodrik

The link to your index config doesn't work, but maybe the retention policy kicked in? If not specified there may be a default one, not sure.

mrcnski avatar Mar 14 '25 18:03 mrcnski