ScalaPB
ScalaPB copied to clipboard
[proposal] Provide a way to inject String deduplication into the deserialization pipeline
This is a follow-up to #1373.
Let's look on great @shipilev talk onto Strings anatomy and effect on JVM performance - https://youtu.be/YgGAUGC9ksk?t=1739
In a nutshell:

So it may benefit some to eagerly release allocated (or even avoid allocation at all!) of objects if we can. It reduces root set of objects for collection or even eliminate additional allocation pressure in some scenarios. Even further, not only strings may benefit from that - any large and often encountered object allocation may be eliminated by proper deduplicator.
Let's consider such deduplicator:
trait Deduplicator[T] {
def deduplicate(bs: ByteString)(implicit codec: ByteStringCodec[T]): T
}
case class Deduplicator {
def noop[T]: Deduplicator[T] = new Deduplicator[T] {
def deduplicate(bs: ByteString)(implicit codec: ByteStringCodec[T]): T = codec.fromByteString(bs)
}
}
Default deduplicator for any "complex" field is Deduplicator.noop
which just proxies call to codec.
More complex deduplicators may be provided by users in message, package or generator options.
Thanks for posting. Can you add to the proposal the following:
- how the duplicator will be selected?
- Would having a deduper require the field to be lazy evaled (as in #1373)?
- Can deduping made optional in the generator?
- Should it impact instantiation outside binary deserialization:
fromPMessage
(JSON, Spark)
how the duplicator will be selected?
One of the options - provide a name of class, that implements DeduplicatorProvider
interface:
trait DeduplicatorProvider {
def getOrCreateDeduplicator[T: ByteStringCodec]: Deduplicator[T]
}
This name may be provided on generator, package, file or even field level.
Would having a deduper require the field to be lazy evaled (as in https://github.com/scalapb/ScalaPB/issues/1373)?
Actually they are related, but not tight coupled. You may want to reduce memory footprint of common objects even if they are eagerly parsed.
So, if you deserialize some string or message type, you may pass your already delimeted ByteString
to Deduplicator
and check the cache presence by ByteString
hashing / comparision.
Can deduping made optional in the generator?
Sure, it may be done completely optional on any configuration level. Also, I also found useful to disable it at for some field - like we have a case, where there is some string description field, which doesn't contribute anything for business logic, but will pollute the strings cache if we will try to deduplicate it. Ideally I would prefer to keep this field lazy & non-cacheable.
This may be achieved by checking the presense of deduplicate field / message option and disable it for field which shouldn't be cached.
Should it impact instantiation outside binary deserialization: fromPMessage (JSON, Spark)
I think so.
Closing due to inactivity. Feel free to comment if this is still needed.