opensearch-benchmark icon indicating copy to clipboard operation
opensearch-benchmark copied to clipboard

[META] Scale-Up Improvements on Single Load Generation Host

Open IanHoang opened this issue 10 months ago • 3 comments

Overview

Understanding the scalability of OpenSearch Benchmark's search clients is crucial for OSB's future development, as it will inform its usage patterns and drive future enhancements.

More details are laid out in this RFC. This meta issue pertains only to the first component of the tasks indicated in the RFC, investigation into the scaling performance of OSB on a single load generation host.

The tasks can be broken up into the following milestones:

Milestones

Efforts: S = estimated 1 week M = estimated 4 weeks L = estimated 6+ weeks

Note: These milestones are scoped towards scaling up clients on a single load generation host (or node) within OSB. For milestones for scaling out OSB clients (or using 2+ load generation hosts or nodes), we'll need to develop an RFC for DWG as well as a separate list of milestones.

Milestone 1: Quantifying current limitations (M) The OSB community is aware that there are limitations in terms of scaling clients within OSB, but is unsure of what those exact limitations are. A majority of the time, OSB install OSB on a single load generation host and specify a number of clients which provisions a certain number of threads. To test well how this works, a performance comparison between a cluster of nodes, each with OSB set to a single client, and a single node with OSB set to several clients will help uncover what those exact limitations are.

This will require setting up a testing apparatus. A few scripts can be created to expedite the performance testing and comparison process. These results should inform us if OSB is accurately emulating metrics as well as provide insight into which OSB components are causing these limitations.

Milestone 2: Identify the workarounds (S) After understanding the limitations, we will determine if there are any quick workarounds that users can resort to to alleviate scaling limitations, while work progresses on long-term solutions.

Based on limitations we have discovered, we should look to modify or add quick changes to the way OSB determines the number of worker actors to use and how it divides the clients amongst its workers. Outside making changes to the codebase, we can publish a guide with some general rule of thumbs to help users avoid issues.

Milestone 3: Investigate bottlenecks (or causes of limitations) and overcome bottlenecks (M) For the limitations discovered in milestone 1, we will need to investigate the bottlenecks in more depth and identify causes. Subsequently, we should identify and implement appropriate solutions on how to resolve such bottlenecks and remove limitations found in OSB.

Since OSB might have workarounds incorporated, we can spend effort investigating the bottlenecks. This will involve looking at specific components within OSB -- such as the worker coordinator actor and the worker(s) actors. By analyzing the actor-system, we should be able to come up with appropriate solutions and potential redesigns to resolve bottlenecks.

Milestone 4: Review (S) After all the work has been done, we should summarize our findings and solutions and ensure that OSB has been appropriately updated to handle scaling better.

From what we've discovered and implemented, we should draft up subsequent action items that can be performed (i.e. should there be any future enhancements or redesigns?). Additionally, work can be commenced on investigating DWG, which allows scaling out beyond a single load generation host.

For more information on each milestone, see the task issues / child issues in the following section:

Child Issues

  • [x] Test Plan Development https://github.com/opensearch-project/opensearch-benchmark/issues/536
  • [x] Scaling Investigation 1: Validate Client Simulation Accuracy https://github.com/opensearch-project/opensearch-benchmark/issues/557
  • [x] Scaling Investigation 2: Stress Load Generation Host and Discover Max Clients Per Worker https://github.com/opensearch-project/opensearch-benchmark/issues/558
  • [ ] Scaling Investigation 3: Stakeholder Investigation, Profiling LG Host, and Understanding Target Throughput Pathway

META Issue containing issues related to scaling in OSB:

https://github.com/opensearch-project/opensearch-benchmark/issues/593

IanHoang avatar Apr 04 '24 20:04 IanHoang

META issue containing issues related to scaling clients in OSB: https://github.com/opensearch-project/opensearch-benchmark/issues/593

IanHoang avatar Jul 25 '24 17:07 IanHoang

For milestones for scaling out OSB clients, we'll need to develop an RFC for DWG as well as a separate list of milestones.

Can we expand on scaling out - are we saying multiple clients (distributed)

Also not sure how Milestone 1 and Milestone 3 are different?

getsaurabh02 avatar Jul 26 '24 16:07 getsaurabh02

After discussing with @gkamat and @getsaurabh02 last week, will perform a preliminary scaling investigation (see child issue scaling investigation #1) to get more data points for us to work with in the RFC and META task here.

IanHoang avatar Jul 30 '24 18:07 IanHoang