ozone
ozone copied to clipboard
HDDS-10811. Reduce UTF8 string encoding by caching encoding result
What changes were proposed in this pull request?
Reduce com/google/protobuf/ByteString.copyFromUtf8 when encode a protobuf Object.
- For a String type protobuf field, if you use the
setXXXmethod (such asbuilder.setHostName), the protobuf will callByteString.copyFromUtf8convert UTF8 String toByteString. - For some UTF8 String with fixed content, we can cache its
ByteStringvalue, and usesetXXXBytesto avoid callingByteString.copyFromUtf8
Before https://issues.apache.org/jira/secure/attachment/13068731/om-create-key-alloc-profiler-setxxxString.html
After
The com/google/protobuf/ByteString.copyFromUtf8 in the DatanodeDetail disappear.
https://issues.apache.org/jira/secure/attachment/13068732/om-create-key-alloc-profiler-setxxbytes.html
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10811
How was this patch tested?
Existing Test.
@xichen01 , thanks a lot for working on this! How about add a
StringWithByteStringclass as below? Than, it won't have any additional cache lookups.diff --git a/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsUtils.java b/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsUtils.java index 26f6be44e5..d62af61d2a 100644
@szetszwo Thanks for the suggestion, but I noticed that in OM, there are quite a few instances of DatanodeDetail, and this implementation adds more memory consumption to the member variable than before.
The current my PR idea is not to add extra memory consumption, so it uses a global static Map to cache the ByteString.
@xichen01 , do you have comparison before and after this change? What is the memory saving? Is it in MBs or GBs?
If there are many DatanodeDetails instances , we probably should cache DatanodeDetails instances instead of the ByteString inside the DatanodeDetails.
@szetszwo Thanks for your reply
If there are many DatanodeDetails instances , we probably should cache DatanodeDetails instances instead of the ByteString inside the DatanodeDetails.
-
But
DatanodeDetailsis not an immutable object, so it is not possible to simply cacheDatanodeDetails, which would require considerable modification, and possibly have some compatibility issues. -
Moreover,
DatanodeDetailsnot only consumes memory space for its own objects, but also consumes a lot of resources during the encode/decode ofDatanodeDetails, this PR is to reduce the consumption ofDatanodeDetailsencode.
@xichen01 , do you have comparison before and after this change? What is the memory saving? Is it in MBs or GBs?
Here's an alloc flame graph that shows the effect of the optimization
Before
https://issues.apache.org/jira/secure/attachment/13068731/om-create-key-alloc-profiler-setxxxString.html
After
The com/google/protobuf/ByteString.copyFromUtf8 in the DatanodeDetail disappear.
https://issues.apache.org/jira/secure/attachment/13068732/om-create-key-alloc-profiler-setxxbytes.html
Of course, this PR is only a small improvement, and it would be a big improvement to be able to cache the DatanodeDetails, which may require more work.
...
DatanodeDetailsis not an immutable object, ...
SCM and other code probably don't needs different DatanodeDetails instances for the same datanode. I agree that it is not easy to fix. So, our change, although not fixing it, should not make it harder to fix it in the future.
... but also consumes a lot of resources during the encode/decode of DatanodeDetails, this PR is to reduce the consumption of DatanodeDetails encode.
This PR also adds Guava Cache with some hard coded size (50,000). Guava Cache is a complicated data structure. It may run threads internally. It needs cache lookup. It will create more problems. It does not worth to save the memory space for 50,000 short ByteString.
That's why I suggest to add StringWithByteString. Please consider it.
This PR also adds Guava Cache with some hard coded size (50,000). ....
In fact, the reason I use Guava Cache is to prevent accidents from occurring that would result in too many keys in the cache. Guava Cache makes it easy to control how much to cache.
If we can guarantee the number of keys to be cached ( hostname, ip, uuid, ... (should not be too much.)), I think we can use a simpler Map Cache structure such as ConcurrentHashMap
That's why I suggest to add StringWithByteString. Please consider it.
StringWithByteString still needs to repeatedly calculate ByteString.copyFromUtf8, which moves the call of ByteString.copyFromUtf8 from inside protobuf to StringWithByteString (Some DatanodeDetails have short lifecycles, may only be accessed once)
But for DatanodeDetail in Pipeline Cache, it will be effective solution
How do you think?
StringWithByteString still needs to repeatedly calculate ByteString.copyFromUtf8, which moves the call of ByteString.copyFromUtf8 from inside protobuf to StringWithByteString (Some DatanodeDetails have short lifecycles, may only be accessed once)
Only the DatanodeDetails created from the Builder needs converting. The DatanodeDetails created from another DatanodeDetails does not (updated the code in my previous comment). The short lived DatanodeDetails probably are not created by the Builder. No?
BTW, we could also change Builder to pass StringWithByteString when converting proto-to-DatanodeDetails. Then, it will already have the ByteString.
@szetszwo Thanks for your new comments and patch, I have applied the patch.