hibernate-orm
hibernate-orm copied to clipboard
Improve ListresultConsumer duplication check method performance
just an attempt to address the potential perf issue mentioned in the https://github.com/hibernate/hibernate-orm/pull/4924 discussion.
Thanks for your pull request!
This pull request appears to follow the contribution rules.
› This message was automatically generated.
my concern is the impact of the new IdentityHashMap
for small/medium results where the original solution had not real perf issues
my concern is the impact of the new
IdentityHashMap
for small/medium results where the original solution had not real perf issues
We could try to allocate the IdentityHashMap
only when exceeding a certain row count threshold, but not sure if that makes things a lot better.
I think, if we pass through a maxResults estimate into org.hibernate.sql.results.spi.ResultsConsumer#consume
, we could reduce the possible negative effects to only HQL/Criteria queries that select entities where we have no good estimate. In the end, I think we simply need guidance from the user. The user should be able to provide a hint that says "don't deduplicate", in which case we can avoid the IdentityHashMap
.
IMO conceptually we'd need a https://en.wikipedia.org/wiki/Bloom_filter Not keen to add a dependency though - could we check how close we get approximating it via an identity hashmap ?
IMO conceptually we'd need a https://en.wikipedia.org/wiki/Bloom_filter Not keen to add a dependency though - could we check how close we get approximating it via an identity hashmap ?
Yeah, either add a dependency or implement/integrate this ourselves in hibernate-core. It's not trivial to implement this, but I think it would be worth it to add a custom implementation. Standard Bloom filters AFAIU need a fixed size, but since we keep the data around, we could grow the bit set on-demand similar to what a hash table does, but use a lot less memory (64 times memory savings i.e. for 1M elements we'd use ~2KB instead of ~1MB). Growing a bit set also consumes a lot less memory, though rehashing is still necessary.