sql icon indicating copy to clipboard operation
sql copied to clipboard

Extend size_limit setting in query engine to support unlimited index query.

Open penghuo opened this issue 3 years ago • 7 comments

Problem statements

Currently, the query.size_limit setting configure the maximum amount of documents to be pull from OpenSearch. The default value is: 200. for example, Let's say size_limit = 200, and index has 10K docs.

  • expected, return 200 rows. source=index
  • expected return 1 docs source=index | head 1
  • return 200 rows, it is not user expected. source=index | head 11000

Proposal

The query.size_limit configure the maximum amount of rows returned by query. The default value is: 200. size_limit must larger than 0. If the query has head(PPL) or limit(SQL). it will override the query.size_limit setting.

Expectation of search query .

  • should return 200 rows. source=index
  • should return 1 rows. source=index | head 1
  • should return 11000 rows. source=index | head 11000

Expectation of aggregation query.

  • should return 200 rows. source=index | stats request, count(*) by request
  • should return 11000 rows.
    source=index | stats request, count(*) by request | head 11000

penghuo avatar Jul 20 '22 23:07 penghuo

Probably another way of thinking about this: the size_limit setting is just for default behavior. If users specify a larger number by head or LIMIT, that means they're aware of what they're doing and just want to override the default limit value. This may be safer than setting size limit to -1 and user run a query without head command later?

dai-chen avatar Jul 21 '22 00:07 dai-chen

Probably another way of thinking about this: the size_limit setting is just for default behavior. If users specify a larger number by head or LIMIT, that means they're aware of what they're doing and just want to override the default limit value. This may be safer than setting size limit to -1 and user run a query without head command later?

Update the proposal as discussed.

penghuo avatar Jul 21 '22 20:07 penghuo

Design

OpenSearch Request

Request Operators

Non Aggregation Query

Interface to the OpenSearch engine used by the OpenSearchIndexScan physical plan

  1. OpenSearchQueryRequest The default request operator.
  2. OpenSearchScrollRequest This is used if query size exceeds the index.max_result_window setting. It invokes scroll requests to OpenSearch and fetches results in batches.
Aggregation Query

There's no scroll request for aggregation queries in OpenSearch. For a composite (group by) aggregation query, the response contains a keyAfter field, which can be used in the next request to fetch the next buckets.

Request Builder

OpenSearchRequestBuilder builds OpenSearchQueryRequest or OpenSearchScrollRequest, depending on whether scrolling is needed.

Physical Plan Implementation

  1. Get index.max_result_window for indices.
  2. Initializes OpenSearchIndexScan, which contains OpenSearchRequestBuilder
  3. Visit logical plan with index scan as context, so logical operators visited will accumulate (push down) OpenSearch query and aggregation DSL on index scan. The operations are pushed down to the request builder.

Index Scan Execution

  1. Build the request upon plan.open()
  2. Fetch the results in batches of size maxResultWindow
  3. When plan.close(), clean up the cursor and context in OpenSearch engine if request type is OpenSearchScrollRequest

seankao-az avatar Aug 01 '22 16:08 seankao-az

Remaining issues:

  1. Extend query size limit for aggregation query requests
  2. Described as follows

Here we assume

query.size_limit = 200
index.max_result_window = 10000

These work as expected:

  • source=index returns 200 rows
  • source=index | head 1 returns 1 row
  • source=index | head 300 returns 300 rows
  • source=index | head 11000 returns 11000 rows using scroll
  • source=index | fields a,b returns 200 rows
  • source=index | fields a,b | head 1 returns 1 row

But these don't:

  • source=index | fields a,b | head 300 returns 200 rows
  • source=index | fields a,b | head 11000 returns 200 rows

The reason being that limit is only pushed down to index scan if they're optimized and merged into a single node. In these two cases the index scan has query size 200 (query.size_limit).

Solution

Option 1

Better logical plan optimization so that the Project logical plan node doesn't block optimization for other plan nodes. Project isn't merged with Relation / Index Scan, and thus stops Limit from merging with Relation / Index Scan

seankao-az avatar Aug 11 '22 00:08 seankao-az

One note on the performance. With this feature, there's no limitation on the size of the query result anymore, so it's possible that a single request-response cycle take too long and timeout.

seankao-az avatar Aug 11 '22 18:08 seankao-az