[Feature Request][gRPC] Move primitive array types out of source
Is your feature request related to a problem? Please describe
Adding gRPC support has been great for performance and interoperability on OpenSearch.
That said, from @karenyrx's presentation and OpenSearchCon NA, it looks like a lot of the performance gain comes from combining gRPC/Protobuf with SMILE encoding for document source, whether as part of an IndexRequest within a _bulk call or as part of the SearchHits returned from a search request. That works, but SMILE is kind of a weird format that's not widely used.
The performance benefit of SMILE largely comes from the fact that the JSON representation of numeric arrays is absolutely terrible. Using 1 byte per base-10 digit, plus a comma between values generally makes JSON at least 2x less efficient compared to a typical binary representation.
Describe the solution you'd like
I propose that we add an optional Map<String, Object> fieldValues field to the IndexRequest class to accompany the document source on the way into OpenSearch. These values would generally be primitive arrays and/or other binary data types. (You could pass any field, I suppose, but you wouldn't see much benefit for text/keyword/single-valued numeric fields where the JSON representation isn't that bad.) The idea is that these fields are not encoded as any kind of XContent. Note that this also means that regular REST APIs won't see any benefit, since they send their whole request as XContent.
On the way in, the fieldValues field should get copied from the IndexRequest into the SourceToParse, which gets passed to the DocumentMapper#parse method. Somewhere in there, we should copy these fields into the ParsedDocument.
This fieldValues field would then be part of the Protobuf representation of IndexRequest, which means we can pass the values as raw Protobuf, without any kind of additional XContent encoding.
Since these fields are not recorded as part of the document source, you'd need to use a mechanism like derived source to read them out if you want them in your SearchHits. I don't think we need to modify SearchHit, since we already have the documentFields object, which returns fields outside of source and can use any arbitrary Object as values.
Related component
Clients
Describe alternatives you've considered
Obviously, we pretty much already get this benefit from using SMILE (or CBOR) as an alternative encoding, since they don't suffer from the same inefficiency as JSON for numeric arrays. I'm just thinking that JSON + Protobuf is easier for most developers to work with compared to these less common formats. (Also, while I don't have the numbers to back it up, I'm guessing that Protobuf -> float[] decoding is probably cheaper than Protobuf -> byte[] -> SMILE -> float[] decoding.
Additional context
Here is @karenyrx and @amberzsy's talk on gRPC + Protobuf, where they about the performance benefits: https://youtu.be/E4ZmtDA9G8c
Thanks @msfroh for the proposal! This is a good idea.
Just to add my 2 cents:
-
From the protobuf schema perspective, if I understand this correctly, we are proposing moving the primitive document fields out from the current
bytesprotobuf field, into amap<string, ObjectMapfield. I theorize that instead of using amap<string, ObjectMap>, we may get even better serialization gains if we were to use rawrepeated float,repeated int, etc fields, and have a separaterepeated string fieldNamesForFloatArray,repeated string fieldNamesForIntArrayetc to store the field names corresponding to the primitive arrays. This way therepeated <primitive>fields can fully capitalize on the protobuf packed encoding feature, which I don't think we'd get with ObjectMap, as it usesrepeated Valueinstead of arepeated <primitive>. -
I am also curious if the overhead of splitting of the document into "primitive" vs "non primitive" fields, will possibly outweigh the latency gains from the improved serialization/deserialization - but this could be tested out.
Overall, this proposal makes sense to me.
For Protobuf, I was thinking that the map could be something like map<string, FieldValue> where FieldValue would be
message FieldValue {
oneof data_type {
IntArray int_array = 1;
FloatArray float_array = 2;
LongArray long_array = 3;
// ... other types
}
}
message IntArray {
repeated int32 values = 1;
}
message FloatArray {
repeated float values = 1;
}
message LongArray {
repeated int64 values = 1;
}
I think that gives us packed encoding, but I'm not very experienced with Protobuf.
Hi @msfroh I really like the idea. I'd like to work on this, can you please assign the issue to me?