[issues-912] optimize serialization size for primitive arrays - IntOnly for now
PR for #912 ( Currently limited to showcase the feature for Int primitive arrays only based on the guidelines )
Motivation: Many times, primitive array is used for optimized/efficient code (size and execution time), where the size of the primitive array is not always accurately decided beforehand. Serializing these primitive arrays can produce large serialized object with mostly blank character (NUL). Kryo being focused in efficiency, should provide configuration to further optimize this behavior to store only the necessary values whenever there is an opportunity to do so.
pls refer to the description on #912 for more info on approach.
Thanks
@jhsenjaliya: Thanks for this PR. Your approach is interesting, but the changes required to support it for all types are quite invasive. I will keep this PR open for now to see if anyone else is interested in this optimization.
Sure, I will let you think through this My observation has been that optimizations will always be tricky, but the value it provides to have least possible storage size would be not only worth but adds lot of value to Kryo. Thanks for the review !
I agree the feature can be useful. I've used skipping zeros at the beginning or end in my projects, where it makes sense.
I don't think we want a setting that changes the behavior of all Input/Output. The feature can be entirely self contained within a serializer. Where you want this, which is unlikely to be everywhere, you would use the serializer.
It could make sense for Kryo to provide such a serializer, though there are many use case specific serializers that could be provided. We don't try to provide them all, especially when the implementation is relatively trivial.
If you really wanted to do it everywhere, you could extend Input/Output, but I don't think it makes sense for Kryo to provide that as it's too application specific.
@NathanSweet , Thanks for providing that input. I believe such optimizations better suit as settings/configs rather than all new serializer/deserializer all together. Also by default this config is OFF, so there is no change to the behavior, only when user needs, it can be turned ON, when user thinks S/He wants additional improvements like this.
I also like ur idea of doing this for continuous default values ( zeros ) in the beginning instead of just end. may be there can be settings/config for all 3 cases -- optimize_continuous_zeros_in_starting_only, optimize_continuous_zeros_in_end_only and optimize_continuous_zeros ( for both )
I can imaging lot of storage savings with this. hope more people finds this useful feature/optimizations when needed.