Consider using Kryo instance pooling
Logging this in the event others are running into performance issues with Kryo serialization. I'd moved away from using the convenient KryoSerializer class due to high instantiation costs.
https://github.com/locationtech/geotrellis/blob/f8af6e6d72749b8fb175a0c44cbc7d027a940686/spark/src/main/scala/geotrellis/spark/util/KryoSerializer.scala#L31-L55
I came across this when investigating things a bit more, and wondered if the Pool class might be a better way than an @transient lazy val:
https://github.com/EsotericSoftware/kryo#pooling
NB: This is not a high priority for me right now, but it's likely I'll experiment with it some.
@metasim Hm, Kryo is usually faster than java serialization, also look how GeoMesa uses it and it works really efficient for Geometries. So you say that JavaSer is faster for you; or what approach do you prefer? I'm curious what is the generic approach you use / would like to use to make Spark work through kryo pooling?
Or you just mean that sometimes we wrap our values manually into KryoWrapper to make serializable and it can be not efficient?
@pomadchin Kryo is faster at the serialization part, once the requisite codec classes are constructed. IOW, the newInstance call here is very expensive in comparison to the actual serialization:
https://github.com/locationtech/geotrellis/blob/f8af6e6d72749b8fb175a0c44cbc7d027a940686/spark/src/main/scala/geotrellis/spark/util/KryoSerializer.scala#L46
Furthermore, Kryo is not thread safe, so you have to be very careful when trying to keep pre-constructed instances around.
This is hinted at in this issue, which is where I found the tip on the Pool class:
https://github.com/EsotericSoftware/kryo/issues/188
This comment from that issue also hints at some of the deeper complexities we have to consider:
Well, AFAIK, big data guys are sometimes using Kryo with Hadoop, Storm, etc. In many cases ThreadLocal is not always usable, because thread pools are often dynamically created, configured and so on. Also, class loaders are sometimes different per thread/thread-pool/task. So, you may need to construct Kryo instances rather dynamically and rather often or you have eventually a deeper re-factoring of your code/system, e.g. to place Kryo in the common parent class loader, etc.
My understanding is that the Pool class addresses these (I think).
BTW, I wasn't saying JavaSer was faster (I'm not even looking at that). Currently I'm hand-rolling serialization to Catalyst struct types. Previously I was serializing instances with Kryo and then storing them as a Catalyst BinaryType. But when I profile execution the code is spending most if its time instantiating the Kryo serializers.
@metasim thanks for clarification, that's cool :+1:
Turns out that the Pool class listed in the Kryo docs doesn't exist in the version of Kryo that Spark uses, and in an initial test using an updated version with Spark causes MethodNotFoundException. :-(