OpenSearch
OpenSearch copied to clipboard
[META] Leverage protobuf for serializing select node-to-node objects
Please describe the end goal of this project
Provide a framework and migrate some key node-to-node requests to serialize via protobuf rather than the current native stream writing (i.e. writeTo(StreamOutput out)). Starting with node-to-node transport messages maintains api compatibility while introducing protobuf implementations which have several benefits:
- Out of the box backwards compatibility/versioning
- Speculative performance enhancements in some areas
- Supports future gRPC work
Supporting References
- Search API POC - flame graph + OSB numbers
- Issue for API leveraging of gRPC + protobuf
- gRPC-based Search API
Issues
- [ ] Feature flag to enable protobuf transport protocol
- [ ] Support handlers for non-native transport protocols
- [ ] Initial protobuf implementation for types - FetchSearchResult, SearchHits, SearchHit
Related component
Search:Performance
Having spent way too much time diving into the byte-level (or bit-level if you consider 7-bit Vints) details of these protocols, I want to make sure we're focusing on the correct things here.
Protobuf isn't some magical thing that brings about 30% improvement the way the POC implemented it, and I don't think the existing POCs tested it fully vs. the existing capabilities of the transport protocol if used properly.
In summary:
- I think most of the performance gain in the POC was based on reducing the number of bytes used to transmit information
- Existing StreamInput/StreamOutput already contains the capability to use variable-length ints/longs, they're just underused
- The existing Transport Protocol using Vint/Vlong is marginally (but likely not significant) more efficient (in bandwidth terms) than protobuf when dealing with unsigned values, and Zint/Zlong is equivalent to protobuf signed values.
- Broadly speaking, protobuf can be marginally more efficient in the case where there are a lot more "null" or "empty" values (you gain a byte in each case of a missing value, but lose a byte for every present value).
- Protobuf needs to identify every field which takes a byte; StreamInput/StreamOutput need an boolean to implement optional.
Protobuf's primary benefit is in backwards compatibility.
There are probably additional significant benefits of gRPC but this meta issue doesn't seem to be about changing the entire protocol, so I don't think those should be considered as part of the benefits proposed.
The performance impacts are very situation specific and shouldn't be assumed; it's possible a change in streaming to variable-length integers/longs (VInt/ZInt) can achieve similar gains.
Hi @dbwiddis, thanks for the feedback!
- I think most of the performance gain in the POC was based on reducing the number of bytes used to transmit information
I am noticing protobuf objects I implement write more bytes to stream than the native "writeTo" serialization. Small protobuf objects are about ~20% larger but the difference shrinks as request content grows. So far I've been focusing on FetchSearchResult and it's members. If i'm understanding correctly the best use case for protobuf (only considering performance) would be something like a stats endpoint which contains a lot of ints/longs, particularly those which are not UUIDs/hashes and could be squished into a single byte representation.
As I learn more about protobuf I'm understanding any potential performance improvements just from making a change to serialization is gong to be speculative and I'll update this issue with benchmarks as I run them. Currently the concrete expected benefits would be out of the box backwards compatibility as well as providing a stepping stone for gRPC support.
Initial numbers for protobuf vs native serializers. Run with 5 data nodes (r5.xlarge) per cluster against OS 2.16. So far there is no apparent speedup and protobuf is as performant as the native protocol in this case. Vector workload in progress.
OSB Big 5
2.16 min distribution
| 90th percentile latency | desc_sort_timestamp | 76.5407 | ms |
| 90th percentile latency | term | 69.2465 | ms |
| 90th percentile latency | multi_terms-keyword | 71.7096 | ms |
| 90th percentile latency | scroll | 503024 | ms |
| 90th percentile latency | query-string-on-message | 76.0508 | ms |
FetchSearchResults as protobuf - Branch
| 90th percentile latency | desc_sort_timestamp | 78.7184 | ms |
| 90th percentile latency | term | 69.6506 | ms |
| 90th percentile latency | multi_terms-keyword | 69.0668 | ms |
| 90th percentile latency | scroll | 560605 | ms |
| 90th percentile latency | query-string-on-message | 74.4521 | ms |
Notes
- Several FetchSearchResults fields could be further decomposed into protobuf. For example
SearchHits.collapseFieldsis still encoded with it’s nativewriteToimplementation. - Initial POC work (diff viewable here) created protobuf exclusive copies of structures within the search path. This may have benefited performance by allowing inlining or reducing vtable lookups when handling transport layer responses. An interesting read on inlining potentially relevant to our use of
writeToin our native serialization. - Search results payload may not be large enough to see any performance difference in the above workloads. Workloads with larger search results may show a more substantial performance difference.
- As noted above a key area where we expect benefit from protobuf is automatic variable length encoding for numeric values.
FetchSearchResultis not particularly suited to this case and the same functionality exists in the native protocol (although underused in some cases). - This implementation deserializes the message immediately. With a slightly different implementation some message types could benefit from protobuf's lazy loading functioncality.
Let's keep in mind that while node-to-node we may not see as much benefit, client-to-server switching from JSON to protobuf should. There's also a significant improvement in developer experience if we can replace the native implementation with protobuf.
Some additional benchmarks for a vector search workload. Again no noticeable difference in performance.
5 data nodes (r5.xlarge) per cluster. 10000 queries against sift-128-euclidean.hdf5 data set.
FetchSearchResults as Protobuf
| Min Throughput | prod-queries | 1.82 | ops/s |
| Mean Throughput | prod-queries | 13.69 | ops/s |
| Median Throughput | prod-queries | 13.84 | ops/s |
| Max Throughput | prod-queries | 13.93 | ops/s |
| 50th percentile latency | prod-queries | 70.1863 | ms |
| 90th percentile latency | prod-queries | 71.6104 | ms |
| 99th percentile latency | prod-queries | 73.9335 | ms |
| 99.9th percentile latency | prod-queries | 79.8549 | ms |
| 99.99th percentile latency | prod-queries | 273.26 | ms |
| 100th percentile latency | prod-queries | 546.72 | ms |
| 50th percentile service time | prod-queries | 70.1863 | ms |
| 90th percentile service time | prod-queries | 71.6104 | ms |
| 99th percentile service time | prod-queries | 73.9335 | ms |
| 99.9th percentile service time | prod-queries | 79.8549 | ms |
| 99.99th percentile service time | prod-queries | 273.26 | ms |
| 100th percentile service time | prod-queries | 546.72 | ms |
| error rate | prod-queries | 0 | % |
| Mean recall@k | prod-queries | 0.97 | |
| Mean recall@1 | prod-queries | 0.99 | |
2.16 Native serialization
| Min Throughput | prod-queries | 1.63 | ops/s |
| Mean Throughput | prod-queries | 13.49 | ops/s |
| Median Throughput | prod-queries | 13.73 | ops/s |
| Max Throughput | prod-queries | 13.84 | ops/s |
| 50th percentile latency | prod-queries | 70.525 | ms |
| 90th percentile latency | prod-queries | 71.8059 | ms |
| 99th percentile latency | prod-queries | 77.1621 | ms |
| 99.9th percentile latency | prod-queries | 83.521 | ms |
| 99.99th percentile latency | prod-queries | 446.145 | ms |
| 100th percentile latency | prod-queries | 610.186 | ms |
| 50th percentile service time | prod-queries | 70.525 | ms |
| 90th percentile service time | prod-queries | 71.8059 | ms |
| 99th percentile service time | prod-queries | 77.1621 | ms |
| 99.9th percentile service time | prod-queries | 83.521 | ms |
| 99.99th percentile service time | prod-queries | 446.145 | ms |
| 100th percentile service time | prod-queries | 610.186 | ms |
| error rate | prod-queries | 0 | % |
| Mean recall@k | prod-queries | 0.97 | |
| Mean recall@1 | prod-queries | 0.99 | |
@finnegancarroll can we run vector search workload with client-to-server switching? Thanks.
Experimenting with a previous POC with diff here: https://github.com/finnegancarroll/OpenSearch/pull/2/files
Benchmarking locally with 5 nodes in separate docker containers with 20% big5 workload.
POC diff:
| Mean Throughput | default | 2.01 | ops/s |
| Median Throughput | default | 2.01 | ops/s |
| Max Throughput | default | 2.01 | ops/s |
| 50th percentile latency | default | 5.2044 | ms |
| 90th percentile latency | default | 5.78371 | ms |
| 99th percentile latency | default | 6.11664 | ms |
| 100th percentile latency | default | 6.24282 | ms |
| 50th percentile service time | default | 3.73635 | ms |
| 90th percentile service time | default | 4.09466 | ms |
| 99th percentile service time | default | 4.55763 | ms |
| 100th percentile service time | default | 4.74975 | ms |
| error rate | default | 0 | % |
| Min Throughput | range | 2.01 | ops/s |
| Mean Throughput | range | 2.01 | ops/s |
| Median Throughput | range | 2.01 | ops/s |
| Max Throughput | range | 2.01 | ops/s |
| 50th percentile latency | range | 33.3813 | ms |
| 90th percentile latency | range | 35.1394 | ms |
| 99th percentile latency | range | 38.2579 | ms |
| 100th percentile latency | range | 39.8031 | ms |
| 50th percentile service time | range | 31.8209 | ms |
| 90th percentile service time | range | 33.8545 | ms |
| 99th percentile service time | range | 36.5908 | ms |
| 100th percentile service time | range | 38.0003 | ms |
| Min Throughput | default | 2 | ops/s |
| Mean Throughput | default | 2 | ops/s |
| Median Throughput | default | 2 | ops/s |
| Max Throughput | default | 2 | ops/s |
| 50th percentile latency | default | 7.04535 | ms |
| 90th percentile latency | default | 7.87684 | ms |
| 99th percentile latency | default | 11.5237 | ms |
| 100th percentile latency | default | 17.7137 | ms |
| 50th percentile service time | default | 5.49358 | ms |
| 90th percentile service time | default | 6.13559 | ms |
| 99th percentile service time | default | 10.3286 | ms |
| 100th percentile service time | default | 16.1667 | ms |
| error rate | default | 0 | % |
| Min Throughput | range | 2 | ops/s |
| Mean Throughput | range | 2 | ops/s |
| Median Throughput | range | 2 | ops/s |
| Max Throughput | range | 2.01 | ops/s |
| 50th percentile latency | range | 15.2952 | ms |
| 90th percentile latency | range | 15.9011 | ms |
| 99th percentile latency | range | 20.6263 | ms |
| 100th percentile latency | range | 24.5943 | ms |
| 50th percentile service time | range | 13.646 | ms |
| 90th percentile service time | range | 14.3388 | ms |
| 99th percentile service time | range | 19.2499 | ms |
| 100th percentile service time | range | 23.1727 | ms |
| error rate | range | 0 | % |
Protobuf increases our payload size around 5% based on term queries run on the big5 dataset. On average the following query run on the big5 dataset produces a ~9.5mb response with a native serializer and ~10mb with protobuf.
curl -X POST "http://localhost:9200/_search" \
-H "Content-Type: application/json" \
-d '{
"size": 10000,
"query": {
"term": {
"log.file.path": {
"value": "/var/log/messages/birdknight"
}
}
}
}'
Without trying to deconstruct the protobuf binary format one explanation could be protobuf losses some space efficiency by requiring each field in a protobuf object have its position marked with a "field number". In contrast the OpenSearch native node-to-node protocol simply expects to read fields in the exact pre-determined order in which they are written.
Some micro benchmarks with flame graphs comparing protobuf serialization to native transport & client/server JSON serialization. Test branch can be found here: https://github.com/finnegancarroll/OpenSearch/tree/serde-micro-bench
Note: Test data is generated with SearchHitsTests::createTestItem so the content is randomized and may not accurately reflect real use cases.
While microbenchmarks are not particularly reliable this does seem to suggest protobuf is more performant in both cases. Much more so on client/server than transport layer. Serialization is likely not a large enough slice of the total latency to make an impact in OSB benchmarks.
protoWriteTo - SearchHitsProtobufBenchmark.writeToBench avgt 4 3.460 ms/op
nativeWriteTo - SearchHitsProtobufBenchmark.writeToBench avgt 2 5.991 ms/op
protoXCont - SearchHitsProtobufBenchmark.toXContBench avgt 2 19.393 ms/op
nativeXCont - SearchHitsProtobufBenchmark.toXContBench avgt 2 46.680 ms/op
Thanks a lot, @finnegancarroll , I am a bit surprised (since "in general" the specialized impl is expected to be a bit faster than generalized one). I am eager to run the SearchHitsProtobufBenchmark out of curiosity, the writeToBench is commented out in your branch, does it correspond to SearchHitsProtobufBenchmark.writeToBench? (if I uncomment it). Thank you.
End to end benchmarks with protobuf on transport layer as well as the REST response for a vector workload. Environment is identical to the previous tests here (with 2.17 as baseline): https://github.com/opensearch-project/OpenSearch/issues/15308#issuecomment-2334816295
| Min Throughput | prod-queries | 1.48 | ops/s |
| Mean Throughput | prod-queries | 13.57 | ops/s |
| Median Throughput | prod-queries | 13.76 | ops/s |
| Max Throughput | prod-queries | 13.87 | ops/s |
| 50th percentile latency | prod-queries | 70.0255 | ms |
| 90th percentile latency | prod-queries | 70.999 | ms |
| 99th percentile latency | prod-queries | 73.3903 | ms |
| 99.9th percentile latency | prod-queries | 79.4551 | ms |
| 99.99th percentile latency | prod-queries | 457.676 | ms |
| 100th percentile latency | prod-queries | 673.036 | ms |
| 50th percentile service time | prod-queries | 70.0255 | ms |
| 90th percentile service time | prod-queries | 70.999 | ms |
| 99th percentile service time | prod-queries | 73.3903 | ms |
| 99.9th percentile service time | prod-queries | 79.4551 | ms |
| 99.99th percentile service time | prod-queries | 457.676 | ms |
| 100th percentile service time | prod-queries | 673.036 | ms |
The change in serialization does not appear to impact latencies in this case. Generated some flame graphs using OS 2.17 on a local 3 node cluster for this workload to estimate how much serialization impacts latency.
Examining the coordinator node for the cluster:
FetchSearchResults.<init> - 1.27 % of cpu w/ 61 samples
SearchHits.toXContent - 1.27% w/ 61 samples
On average REST response from the coordinator is a relatively small ~13kb.
Hi @reta, i've cleaned up SearchHitsProtobufBenchmark to make it easier to reproduce. To provide the micro benchmarks with test items i've added a small unit test which generates and writes instances of 'SearchHits' to a temporary directory. The whole micro benchmark test suite can be run with the following two commands:
./gradlew server:test --tests "org.opensearch.search.SearchHitsTests.testMicroBenchmarkHackGenerateTestFiles" -Dtests.security.manager=false
./gradlew -p benchmarks run --args 'SearchHitsProtobufBenchmark'
I agree it is surprising that SearchHitsProtobufBenchmark.writeToNativeBench is no faster than writing the protobuf object in this case, although they are very close. Messing with the test params and changing machines I see similar results. With more test files and iterations:
SearchHitsProtobufBenchmark.toXContNativeBench avgt 26.244 ms/op
SearchHitsProtobufBenchmark.toXContProtoBench avgt 13.193 ms/op
SearchHitsProtobufBenchmark.writeToNativeBench avgt 16.033 ms/op
SearchHitsProtobufBenchmark.writeToProtoBench avgt 12.170 ms/op
Looking through the flame graphs of SearchHitsProtobufBenchmark.writeToProtoBench and SearchHitsProtobufBenchmark.writeToNativeBench i'm wondering if the difference comes down to how SearchHit.documentFields and SearchHit.metaFields are handled. SearchHitsProtobufBenchmark.writeToNativeBench spends almost 40% of its time in StreamOutput::writeGenericValue but I don't see the same kind of time spent writing to stream in the protobuf implementation.
Thanks @finnegancarroll
SearchHitsProtobufBenchmark.writeToNativeBenchspends almost 40% of its time inStreamOutput::writeGenericValuebut I don't see the same kind of time spent writing to stream in the protobuf implementation.
Very true, I came to the same conclusions, I will try to briefly look what could be the issue there (without spending too much time). Thanks a lot for cleaning up the benchmarks!
While investigating areas we might expect to benefit from protobuf I spent some time reproducing this previous POC: https://github.com/opensearch-project/OpenSearch/issues/10684#issuecomment-1876077885
Below benchmarks were run locally with 5 nodes, gradle run, and the nyc_taxis workload which is as close an environment as I can manage to the previous POC benchmarks. There is a noticeable difference in latency between these results and the original OSB numbers which could be attributed to using more of the nyc_taxis dataset or a differences in hardware.
Edit: Previous POC gives Store size | Â | 0.000252848 | GB while the below is using all 24GB which likely accounts for the difference in latency.
Protobuf Proof of Concept Branch POC branch OSB-poc-search-protobuf.zip
90th percentile latency | range | 240.02 | ms |
90th percentile latency | range | 236.208 | ms |
90th percentile latency | range | 240.254 | ms |
90th percentile latency | range | 231.929 | ms |
90th percentile latency | default | 12.0153 | ms |
90th percentile latency | default | 10.2515 | ms |
90th percentile latency | default | 9.61354 | ms |
90th percentile latency | default | 9.62332 | ms |
Native Branch Head of main for the above POC branch (commit d916f9c1027f5b2ccff971a66921b7c26db1688f) OSB-d916f9c.zip
90th percentile latency | range | 236.776 | ms |
90th percentile latency | range | 243.967 | ms |
90th percentile latency | default | 9.56527 | ms |
90th percentile latency | default | 7.67412 | ms |
Inspecting flame graphs for the poc-search-protobuf I see nearly all cpu cycles are consumed by lucene processing the search and little time is spent on serialization. poc-search-protobuf5NodeClusterDefaultFlameGraphs.zip poc-search-protobuf5NodeClusterRangeFlameGraphs.zip
@finnegancarroll This is great seeing benchmarking and deep diving into the details here!
Without trying to deconstruct the protobuf binary format one explanation could be protobuf losses some space efficiency by requiring each field in a protobuf object have its position marked with a "field number". In contrast the OpenSearch native node-to-node protocol simply expects to read fields in the exact pre-determined order in which they are written.
Having deconstructed the protobuf binary format, I'll confirm this observation. :)
In short: protobuf uses a byte for each field that exists (type and index info) and 0 bytes for nonexistent fields. The binary transport protocol doesn't mark each field but assumes all fields are present in a specific order, and requires the use of a byte for a boolean "optional". So the takeaway from a "bandwidth" perspective is that if you tend to have a lot of optional information, you can save space with protobuf.
But as your tests seem to show, this marginal difference in bandwidth seems to be less significant, and maybe not the thing to focus on....
Thanks @finnegancarroll , my apologies for the delay, just completed may part. I didn't spend too much time on native serialization but it looks like (with the benchmarks) StreamOutput::getGenericType (that is being used by metaFields and documentFields) is the bottleneck. I think we could use some clever tricks to make it faster for collections (one such example would be to capture the type of the first element in the collection since I suspect those are homogeneous but have different types). I could spend more time on that if you think it is beneficial.
But as your tests seem to show, this marginal difference in bandwidth seems to be less significant, and maybe not the thing to focus on....
This is a very valid point @dbwiddis but we also should not forget the maintenance costs. Changes in native protocol are difficult now and incur a burden on everyone (update main, backport + change version, forwardport to main, ...). Protobuf takes care of that (as far as schema is properly evolved), I think that would bring more benefits than marginal latency improvements.
Hi @reta, I spent some time exploring your above suggestion and I'm seeing benefits but only in some limited artificial cases for the moment. StreamOutput::getGenericType looks to inline getGenericType and getWriter but running benchmarks with the below gives about ~20% of our total time is spent in StreamOutput::getGenericType.
./gradlew -p benchmarks run --args ' SearchHitsProtobufBenchmark.writeToNativeBench -jvmArgs "-XX:CompileCommand=dontinline,org.opensearch.core.common.io.stream.StreamOutput::getGenericType -XX:CompileCommand=dontinline,org.opensearch.core.common.io.stream.StreamOutput::getWriter"'
(StreamOutput::getGenericType in purple)
writeGenericNoInline.zip
Stepping through with a debugger it looks like we do save on some calls to StreamOutput::getGenericType but in the case of test items taken from the unit test suite and OSB datasets i'm having difficulty finding DocumentFields that are specifically large collections.
To quickly hack in a "best case" scenario dataset I modified the unit tests such that DocumentFields could only be 100 element long integer lists. With the optimized writing implemented like this.
This gives pretty substantial speedup in microbenchmarks, especially considering the cpu now spends nearly all of it's time writing specifically DocumentFields.
// Control
SearchHitsProtobufBenchmark.writeToNativeBench avgt 2 2205.636 ms/op
// Skip type checking for homogeneous collections
SearchHitsProtobufBenchmark.writeToNativeBench avgt 4 658.330 ± 255.170 ms/op
Control
Skip type checking for homogeneous collections
Thanks a lot @finnegancarroll , it looks like we are at crossroads now: should we move forward with protobuf (since the latency wins are moderate) or (may be) move forward with optimizing the native protocol?
Additional benchmarks comparing protobuf to native protocol with the same inflated DocumentFields size as used here.
SearchHitsProtobufBenchmark.toXContNativeBench avgt 2 1787.097 ms/op
SearchHitsProtobufBenchmark.toXContProtoBench avgt 2 832.177 ms/op
SearchHitsProtobufBenchmark.writeToNativeBench avgt 2 783.957 ms/op
SearchHitsProtobufBenchmark.writeToProtoBench avgt 2 807.185 ms/op
Additional benchmarks comparing protobuf to native protocol with the same inflated DocumentFields size as used here.
Thanks @finnegancarroll , writeToNativeBench & writeToProtoBench are really close, I think XContent ones we could discard (at the moment) since node-to-node does not use this serialization format (only native serialization).
Thanks a lot @finnegancarroll , it looks like we are at crossroads now: should we move forward with protobuf (since the latency wins are moderate) or (may be) move forward with optimizing the native protocol?
I think we should do both. Switching to an industry standard is a long term bet on improving developer experience and an entire community optimizing the binary protocol for us.
Closing this issue to move forward with client facing gRPC server - #16556.