opensearch-py Improve performance in sync client

What is the bug?

I am using opensearch-py client version 2.2.0 to do search on the a cluster. I started using the multi-threading to push more and more request to the cluster. But I see that after 4 threads we are not able to increase the throughput. I see that no throttling happening from Server, but client is not able to push the request.

How can one reproduce the bug?

Create an OpenSearch cluster, and use OpenSearch-py client with multi-threading to start doing then queries. Thread count we tested with 1,2,4,8,16,32,48,96 threads.

What is the expected behavior?

Expectation is OpenSearch client should be able to push through more and more request and if OpenSearch cluster is not able to take the request we should get a throttling exception.

What is your host/environment?

Linux.

Do you have any screenshots?

NA

Do you have any additional context?

I tried increasing the worker pool size from 10 to 20, but that didn't help.

Jul 20 '23 23:07 navneet1v

Hey, what are your clusters index settings? I would expect this with default cluster settings as shards in an index are thread bound as well. A default index will have 3 shards and that's the max number of threads that can be used for ingestion.

Jul 21 '23 00:07 dtaivpp

Are you spawning threads and creating instances of the client? Can you please post a repro? Is your actual usage of the client sync or async?

I do think that a client should be able to push the server into a state where it starts returning 429s.

Jul 21 '23 17:07 dblock

@dblock It's the same client which is getting used across the threads.

@dtaivpp , I checked the cluster and the CPU util was not going above 60% even at the maximum thread. That is where I think its the problem where client is not able to push the requests.

I also tried changing the pool_max_size from 10 to 20 even then there is no increase in the requests that can be sent to the cluster.

Jul 21 '23 19:07 navneet1v

Are you using a sync client or an async client? My guess is that you're using a sync client.

I wrote a benchmark that compares synchronous, threaded and asynchronous I/O in opensearch-py. I was not able to push the synchronous nor the threaded version of the client past a certain limit, no-matter the amount of pool_maxsize or other settings I tried. The client does use a connection pool, but it's ~probably just too busy doing other work on a single thread blocked by the GIL, so spawning more threads doesn't help~.

Sync vs. threaded inserts make no difference. For 5x50 inserts async outperforms sync 1.4x.

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃          Benchmark ┃ Min     ┃ Max     ┃ Mean    ┃ Min (+)         ┃ Max (+)         ┃ Mean (+)        ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. threaded. │ 1.387   │ 1.604   │ 1.450   │ 1.377 (1.0x)    │ 1.424 (1.1x)    │ 1.399 (1.0x)    │
│    Sync vs. async. │ 1.312   │ 1.479   │ 1.400   │ 0.960 (1.4x)    │ 1.047 (1.4x)    │ 0.994 (1.4x)    │
└────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

5x250 items, 2.5x improvement

┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. async. │ 4.426   │ 4.994   │ 4.694   │ 1.798 (2.5x)    │ 1.992 (2.5x)    │ 1.894 (2.5x)    │
└─────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

5x1000 items, 3.5x improvement

┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ Sync vs. async. │ 16.629  │ 17.493  │ 16.999  │ 4.863 (3.4x)    │ 5.025 (3.5x)    │ 4.980 (3.4x)    │
└─────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

We may be able to improve the synchronous client performance on insert, but I don't (yet) know how.

Jul 24 '23 22:07 dblock

I have a (not very scientific) flame chart I recorded of a synchronous client doing some search requests with SigV4. Obviously these numbers are very dependent on the specific request etc but was done with large machines that hardware shouldn't be a bottleneck.

profile_q

Jul 24 '23 22:07 Xtansia

Here's the sync/async benchmark profile, it's actually stuck doing synchronous I/O.

Jul 24 '23 22:07 dblock

Here's the sync/async benchmark profile, it's actually stuck doing synchronous I/O.

This is strange, why it is struck in synchronous I/O. My understanding was I/O should be done in async fashion, even when it is a sync client.

Jul 24 '23 22:07 navneet1v

This is strange, why it is struck in synchronous I/O. My understanding was I/O should be done in async fashion, even when it is a sync client.

Unfortunately not, and it's by design. The default Urllib3HttpConnection uses urllib3 and the requests are synchronous. This is why aiohttp was introduced, but it requires your code to become async to take advantage of.

Jul 24 '23 22:07 dblock

@dblock can we use some other implementation of HttpConnection. Generally we use RequestsHttpConnection , does that also have the same problem?

Jul 24 '23 23:07 navneet1v

@dblock here is my thinking, as a user of the OpenSearch client I feel that OpenSearch sync client is easy to use. Most of the services have the same thing. But if we cannot use threading with simple sync clients that a bumber. Because async client is not easy.

Jul 24 '23 23:07 navneet1v

@dblock can we use some other implementation of HttpConnection. Generally we use RequestsHttpConnection , does that also have the same problem?

Yes it does. That one does allow changing pool_maxsize, but that has little effect on throughput.

Jul 24 '23 23:07 dblock

@dblock here is my thinking, as a user of the OpenSearch client I feel that OpenSearch sync client is easy to use. Most of the services have the same thing. But if we cannot use threading with simple sync clients that a bumber. Because async client is not easy.

What are other open-source client examples that talk HTTP that don't exhibit this kind of problem?

PS: I did find https://number1.co.za/how-to-speed-up-http-calls-in-python-with-examples/ that claims that requests-futures can get close to async performance, which could be an idea for a reimplementation, but it ain't a small effort.

Jul 24 '23 23:07 dblock

@dblock can we check milvus.

Jul 24 '23 23:07 navneet1v

@dblock can we check milvus.

Seems to be using grpc, whole other ball game.

Jul 24 '23 23:07 dblock

Let me check if there are any other clients I can find out which are not grpc. This seems to be a standard thing and a fix should be there.

Jul 24 '23 23:07 navneet1v

Can you run multiple pythons with multiple clients? And there is also bulk, which helps max out the connection.

Jul 25 '23 04:07 wbeckler

@wbeckler No, we don't want to run multiple clients, as that defeats the purpose of what we are trying to benchmark.

Also, the request which we are trying to send is for query and not indexing. Hence bulk is not an option.

Jul 26 '23 00:07 navneet1v

Thanks @navneet1v. I do think the problem is clear, we want the client to not block on I/O across requests.

Jul 26 '23 14:07 dblock

Thanks @navneet1v. I do think the problem is clear, we want the client to not block on I/O across requests.

Yes that is correct.

Jul 26 '23 16:07 navneet1v

@dblock @wbeckler is anything required from my side to prioritize this issue?

Jul 26 '23 18:07 navneet1v

Is it okay to use python multiprocessing instead of multithreading in your use case? https://docs.python.org/3/library/multiprocessing.html

Jul 26 '23 22:07 wbeckler

@wbeckler we can use that, but given that sync client is easy to use and is already integrated in various ann benchmarks, this is the reason why we want to keep things simple.

Jul 27 '23 19:07 navneet1v

I'm suggesting that the benchmarking or high-volume query application rely on multiprocessing, and that way the python client would not be blocking on IO. So instead of multithread: {do python client stuff}, you would have multiprocess: {do python client stuff}

I am not a python expert, so I might be way off here, but that's how it seems it should work from my naive perspective.

Jul 30 '23 03:07 wbeckler

I am going to take another look at this, assigning to myself.

Oct 03 '23 21:10 dblock

Updated benchmarks in https://github.com/dblock/opensearch-py/tree/benchmarks/samples/benchmarks.

Benchmark	Min	Max	Mean	Min (+)	Max (+)	Mean (+)
1 client vs. more clients (async)	1.076	1.076	1.076	0.787 (1.4x)	0.787 (1.4x)	0.787 (1.4x)
1 thread vs. 2 threads (sync)	6.013	6.013	6.013	3.629 (1.7x)	3.629 (1.7x)	3.629 (1.7x)
1 thread vs. 8 threads (sync)	6.031	6.031	6.031	1.186 (5.1x)	1.186 (5.1x)	1.186 (5.1x)
1 thread vs. 32 threads (sync)	5.777	5.777	5.777	1.058 (5.5x)	1.058 (5.5x)	1.058 (5.5x)

We are getting an improvement by using threads (1.7x with 2, 5.1x with 8, with diminishing returns further, 5.5x with 32).

If we set the pool_maxsize to the number of threads this gets improved up to a certain amount.

Benchmark	Min	Max	Mean	Min (+)	Max (+)	Mean (+)
1 client vs. more clients (async)	1.401	1.401	1.401	0.788 (1.8x)	0.788 (1.8x)	0.788 (1.8x)
1 thread vs. 2 threads (sync)	7.355	7.355	7.355	2.835 (2.6x)	2.835 (2.6x)	2.835 (2.6x)
1 thread vs. 4 threads (sync)	6.616	6.616	6.616	1.777 (3.7x)	1.777 (3.7x)	1.777 (3.7x)
1 thread vs. 8 threads (sync)	6.340	6.340	6.340	1.272 (5.0x)	1.272 (5.0x)	1.272 (5.0x)
1 thread vs. 32 threads (sync)	6.608	6.608	6.608	1.076 (6.1x)	1.076 (6.1x)	1.076 (6.1x)

Sync vs. async. Async really just depends on the pool size (3 iterations).

Benchmark	Min	Max	Mean	Min (+)	Max (+)	Mean (+)
async (1 client) vs. sync (8 threads)	3.560	3.560	3.560	3.210 (1.1x)	3.210 (1.1x)	3.210 (1.1x)
async (8 clients) vs. sync (8 threads)	3.526	3.526	3.526	2.342 (1.5x)	2.342 (1.5x)	2.342 (1.5x)

Oct 07 '23 15:10 dblock

I've spent a lot of time on the synchronous client, having written multiple benchmarks and digging through the code to understand how pooling works. I've varied number of threads, number of clients, search and ingestion, data sizes, switching between the default Urllib3HttpConnection, or the available RequestsHttpConnection connection classes.

The simplest benchmark is to do info() in a loop 100 times. The client improves throughput 5x and 3x with different connection classes on more threads and setting the max pool size to the number of threads.

Benchmark	Min	Max	Mean	Min (+)	Max (+)	Mean (+)
1 thread vs. 32 threads (sync), `RequestsHttpConnection`	30.104	30.104	30.104	5.697 (5.3x)	5.697 (5.3x)	5.697 (5.3x)
1 thread vs. 32 threads (sync), `Urllib3HttpConnection`	9.646	9.646	9.646	2.748 (3.5x)	2.748 (3.5x)	2.748 (3.5x)

This does show that RequestsHttpConnection is a lot slower. @navneet1v mentions we should remove it from being the default recommended in the documentation, which seems right.

Overall, these benchmarks are pretty conclusive that the client does not have the limitation as described and we need to dig deeper into how it's being used to see where the bottleneck is.

Oct 12 '23 18:10 dblock

@dblock Thanks for doing the deep-dive. Most of the time as a user we go by seeing the examples. If changing the connection client is the only problem we should just fix the examples and our documentation.

Also can we mark RequestsHttpConnection as deprecated and more things around using this connection client can degrade the performance.

Oct 12 '23 18:10 navneet1v

In https://github.com/opensearch-project/opensearch-py/pull/535 I documented the connection pools and made sure pool_maxsize works for both implementations (it doesn't improve benchmarks by much).

I don't think we should deprecate RequestsHttpConnection, even if it's slower. If an application fully depends on that, you will not want to take a dependency on another HTTP library. Similarly, using async in this client is much faster than Urllib3HttpConnection, but we aren't depredating the synchronous implementation.

Oct 12 '23 20:10 dblock

In #535 I documented the connection pools and made sure pool_maxsize works for both implementations (it doesn't improve benchmarks by much).

I don't think we should deprecate RequestsHttpConnection, even if it's slower. If an application fully depends on that, you will not want to take a dependency on another HTTP library. Similarly, using async in this client is much faster than Urllib3HttpConnection, but we aren't depredating the synchronous implementation.

Make sense. My whole idea was we should provide our recommendation for best performance and also the challenges in using other connection libs RequestsHttpConnection doesn't perform well, per our benchmarks.

Oct 12 '23 21:10 navneet1v

I've spent a lot of time on the synchronous client, having written multiple benchmarks and digging through the code to understand how pooling works. I've varied number of threads, number of clients, search and ingestion, data sizes, switching between the default Urllib3HttpConnection, or the available RequestsHttpConnection connection classes.

The simplest benchmark is to do info() in a loop 100 times. The client improves throughput 5x and 3x with different connection classes on more threads and setting the max pool size to the number of threads.

Benchmark Min Max Mean Min (+) Max (+) Mean (+) 1 thread vs. 32 threads (sync), RequestsHttpConnection 30.104 30.104 30.104 5.697 (5.3x) 5.697 (5.3x) 5.697 (5.3x) 1 thread vs. 32 threads (sync), Urllib3HttpConnection 9.646 9.646 9.646 2.748 (3.5x) 2.748 (3.5x) 2.748 (3.5x) This does show that RequestsHttpConnection is a lot slower. @navneet1v mentions we should remove it from being the default recommended in the documentation, which seems right.

Overall, these benchmarks are pretty conclusive that the client does not have the limitation as described and we need to dig deeper into how it's being used to see where the bottleneck is.

Its great to see the benchmarks for this client! @dblock Anyway we can make these benchmarks part of this repo or part of opensearch-benchmark?

Oct 12 '23 21:10 VachaShah

opensearch-py opensearch-py copied to clipboard

Improve performance in sync client

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

opensearch-py
opensearch-py copied to clipboard