quickwit
quickwit copied to clipboard
Search requests with high `max_hits` (> ~100) causes fetch error with indexes stored on Cloudflare R2.
Describe the bug
Search requests with high max_hits (> ~100) causes fetch errors with indexes stored on Cloudflare R2.
This is likely because Quickwit goes over the Cloudflare R2 limit of 1000 GetObject requests per second: https://developers.cloudflare.com/r2/platform/limits/#account-plan-limits
As discussed on discord, it would be helpful if Quickwit propagates the Cloudflare R2 error message to confirm this theory.
Steps to reproduce (if applicable) Steps to reproduce the behavior:
- Search request:
❯ curl http://127.0.0.1:7280/api/v1/reddit_comments_v05/search\?query\=frankfurt\&max_hits\=1000\&sort_by_field\=-created_utc
{
"InternalError": "Internal error: `Error when fetching docs for splits [\"01G7YXYNFFVCQBY8Y2B5078EWW\", \"01G7YXPQQM8DER0F8112WK5C2P\", \"01G7YY6GH3N259P137D7TF6PA7\", \"01G7YWQHPA2Q1PHK8B0BXS0ZJP\"]: searcher-doc-async\n\nCaused by:\n An IO error occurred: 'Failed to fetch slice 1448721986..1449072261 for object: s3://test-bucket/indexes/reddit_comments_v05/01G7YXPQQM8DER0F8112WK5C2P.split'.`."
}%
Configuration: Please provide:
- Output of
quickwit --version
❯ RUST_LOG=quickwit=debug bin/quickwit --version
Quickwit 0.3.1nightly
Just compiled the latest main branch, including this PR: https://github.com/quickwit-oss/quickwit/pull/1717
Doesn't solve the issue in this case, but the error message changes to the following:
2022-07-16T01:44:26.987Z ERROR fetch_docs: quickwit_search::fetch_docs: Error when fetching docs in splits. split_ids=["01G7YST0ZMXB9VKBQCR3B6JAEN", "01G7YY6GH3N259P137D7TF6PA7", "01G7YXYNFFVCQBY8Y2B5078EWW"] error=searcher-doc-async
Caused by:
An IO error occurred: 'Error obtaining chunk: error reading a body from connection: unexpected end of file Ctx:Failed to fetch slice 1420380118..1420730462 for object: s3://test-bucket/indexes/reddit_comments_v05/01G7YST0ZMXB9VKBQCR3B6JAEN.split'
Trying to reproduce this on Cloudflare with the hdfs dataset without success.
/search?query=severity_text:ERROR&max_hits=3000with260num_hits runs fine/search?query=severity_text:INFO&max_hits=3000with2716636num_hits runs fine
leaf_search index="hdfs_large" splits=[
SplitIdAndFooterOffsets { split_id: "01G92CG810QEQ8FN7G18ZMQF4S", split_footer_start: 28996247, split_footer_end: 29004531 },
SplitIdAndFooterOffsets { split_id: "01G92CJ2NKBBNVQAF48AWY3ZED", split_footer_start: 28097070, split_footer_end: 28105322 },
SplitIdAndFooterOffsets { split_id: "01G92CKXC20W7625X3493EAYC3", split_footer_start: 28951191, split_footer_end: 28959463 },
SplitIdAndFooterOffsets { split_id: "01G92CNR3BZ62FYJJX0N3WVNHC", split_footer_start: 30100613, split_footer_end: 30109064 },
SplitIdAndFooterOffsets { split_id: "01G92D7AG96SBF0SRJTX6HB89W", split_footer_start: 30082968, split_footer_end: 30091340 },
SplitIdAndFooterOffsets { split_id: "01G92D95820XT2JKRFNYHP66CQ", split_footer_start: 31186028, split_footer_end: 31194332 },
SplitIdAndFooterOffsets { split_id: "01G92DAZXR9XPQRJYV1MVFADFP", split_footer_start: 29952318, split_footer_end: 29960094 }
]
@laurids-reichardt it seems you are using an open dataset, can you please drop the link, you index config could also help in running with the exact field options?
@evanxg852000 Yes, here are the steps to reproduce a similar setup:
# download the dataset
curl -O https://files.pushshift.io/reddit/comments/RC_2022-04.zst
# decompress and ingest dataset via CLI
zstd -d --stdout RC_2022-04.zst --long=31 | ./quickwit index ingest --index reddit_comments_v05
Index config:
version: 0
index_id: reddit_comments_v05
index_uri: "s3://test-bucket/indexes/reddit_comments_v05"
doc_mapping:
field_mappings:
- name: created_utc
type: i64
fast: true
- name: body
type: text
tokenizer: en_stem
record: position
indexing_settings:
timestamp_field: created_utc
commit_timeout_secs: 1200
resources:
- num_threads: 4
search_settings:
default_search_fields: [body]
My original setup included the base36 to i64 converted comment_id (name), of the above-mentioned dataset, as another fast field.
- name: comment_id
type: i64
fast: true
It's a bit more involved to get this value inside the index, as it depends on clickhouse-local for preprocessing. If you believe this to be relevant, I'm happy to provide more instructions on how to do this.