OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

[BUG] Significant performance degradation with Scroll API in OpenSearch versions >= 2.6

Open wimroy opened this issue 1 year ago • 6 comments

Describe the bug

We noticed a significant performance hit while using the Scroll API to scroll a large volume of data using version 2.11.0. A process that previously took a couple of hours with version 2.3.0 is now taking several more hours after upgrading to 2.11.0. We ran some tests which indicate this performance change to have been introduced in 2.6.0. This is also easily reproducible across various environments including a cluster with multiple master and data nodes, as well as a local system running a single node instance. As seen in the attached graphs, that prior to 2.6.0 the scroll API has a relatively flat response time and starting with 2.6.0 the response time increases (linearly?) to some level and stays at that level thereafter. Overall the response time for 2.6+ is worse off than prior versions.

Related component

Search:Performance

To Reproduce

Test Machine Details

  • MacBook Pro M3 with 36GB RAM and 500GB of storage running MacOS 14.6.1
  • JDK-17.0.12

Setup

  • Install GNU Coreutils required for the gdate and ghead commands (Required only on MacOS)

    $ brew install coreutils

  • Clone the opensearch repository and checkout the tag for the version to test.

    $ git checkout -b 2.5.0 tags/2.5.0

  • Build the distribution

    $ ./gradlew assemble -x test

  • Extract the relevant distribution for your platform. In my case, I used the no-jdk-darwin-arm64-tar bundle.

    cluster.name: local-cluster
    node.name: node1
    path.data: /tmp/data
    path.logs: /tmp/logs
    path.repo: /tmp/backup
    network.host: 0.0.0.0
    http.port: 10200
    transport.port: 10300
    cluster.initial_cluster_manager_nodes: node1
    bootstrap.system_call_filter: false
    

Test

  • Index about 40 million records into OpenSearch (using opensearch_data_load.py, requirements.txt).

    $ python3 -m venv .testenv
    $ source ./.testenv/bin/activate
    $ pip3 install -r requirements.txt
    $ python opensearch_data_load.py
    
  • Run the test script (perftest.sh) that scrolls through the dataset. We used a batch size of 10,000 to scroll which was most optimal for our use case. If running on Linux, replace instances of gdate with date in the script.

    $ ./perftest.sh 10000 perf_test > testrun.log

  • Extract the response times from the log file into a charting tool like MSExcel to see the trend. Use the head command instead of ghead if running on Linux.

    $ tail -n +2 testrun1.log | ghead -n -3 | awk '{print $(NF-1)}'

  • Repeat the above Setup and Test for 2.6.0 and other OpenSearch versions.

Expected behavior

Our expectation is that the Scroll API performance should remain consistent across different versions, but especially not get any worse.

Additional Details

Plugins No plugins installed

Screenshots ScrollResponse-OpenSearch-2 3 0 ScrollResponse-OpenSearch-2 4 0 ScrollResponse-OpenSearch-2 5 0 ScrollResponse-OpenSearch-2 6 0 ScrollResponse-OpenSearch-2 7 0 ScrollResponse-OpenSearch-2 11 0 ScrollResponse-OpenSearch-2 16 0

Host/Environment (please complete the following information):

  • MacBook Pro M3 with 36GB RAM and 500GB of storage running MacOS 14.6.1
  • JDK-17.0.12

Additional context Test Scripts to load and test are attached scripts.zip

wimroy avatar Oct 10 '24 00:10 wimroy

Thanks for reporting this. Since you have a clear repro, try bisecting this down a change and running tests against a local environment launched via ./gradlew run?

dblock avatar Oct 10 '24 15:10 dblock

@dblock - Change that introduced the performance regression is in this commit 4aeae87c25b in 2.6.0.

See the attached charts below for 4aeae87c25b and the previous commit cb42bb2a1f7.

ScrollResponse-OpenSearch-2 6 0-cb42bb2a1f7ff941cf7e239d72e73f681102a0ce ScrollResponse-OpenSearch-2 6 0-4aeae87c25b6668e327eb98956cd9727fbb9d38f

Looking at the change in 4aeae87c25b, though, I'm not so sure it's a bug, but more of a documentation issue. We were able to work around this issue by adding a sort criteria to sort by _doc in Ascending order to the initial Search request created before we start Scrolling. I didn't find any reference in the OpenSearch docs on using _doc to scroll (apologies if it's documented and I missed it). I did, however, find a note in the Elasticsearch docs.

_doc has no real use-case besides being the most efficient sort order. So if you don’t care about the order in which documents are returned, then you should sort by _doc. This especially helps when scrolling.

Modifying my initial scroll query to this gave improved response times. See the additional graphs tested with 2.11.0 for scrolling with and without the sort criteria.

'{
  "size": '"$batch_size"',
  "_source": false,
  "version": true,
  "explain": false,
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_doc": {
        "order": "asc"
      }
    }
  ]
}'

ScrollResponse-OpenSearch-2 11 0-no-sort_doc ScrollResponse-OpenSearch-2 11 0-sort_doc

wimroy avatar Oct 14 '24 04:10 wimroy

[Search Triage] @gashutos Can you please verify if this is related to the optimization you did?

sandeshkr419 avatar Oct 16 '24 16:10 sandeshkr419

@rishabh6788 Do we have any operation in our nightly benchmarks that can help verify this going forward? cc: @rishabhmaurya

getsaurabh02 avatar Oct 16 '24 16:10 getsaurabh02

We do have scroll operation running for big5 workload in our nightlies but the oldest version data that we have is for OS-2.7 version. Before that we have it for 1.x line.

Screenshot 2024-10-16 at 10 55 31 AM

Graph Link: https://s12d.com/nWqMHevE This is on single-node cluster with 1-shard-0-replica configuration. The corpus size is 100G, index size if ~24G (116M records).

rishabh6788 avatar Oct 16 '24 18:10 rishabh6788

We were able to work around this issue by adding a sort criteria to sort by _doc in Ascending order to the initial Search request created before we start Scrolling.

Maybe we should just do that by default. If someone does a scroll without specifying their own sort criteria, we can/should sort by _doc ascending.

msfroh avatar Oct 16 '24 22:10 msfroh