hibernate-orm icon indicating copy to clipboard operation
hibernate-orm copied to clipboard

Improve ListresultConsumer duplication check method performance

Open dreab8 opened this issue 2 years ago • 5 comments

just an attempt to address the potential perf issue mentioned in the https://github.com/hibernate/hibernate-orm/pull/4924 discussion.

dreab8 avatar Aug 09 '22 07:08 dreab8

Thanks for your pull request!

This pull request appears to follow the contribution rules.

› This message was automatically generated.

my concern is the impact of the new IdentityHashMap for small/medium results where the original solution had not real perf issues

dreab8 avatar Aug 10 '22 13:08 dreab8

my concern is the impact of the new IdentityHashMap for small/medium results where the original solution had not real perf issues

We could try to allocate the IdentityHashMap only when exceeding a certain row count threshold, but not sure if that makes things a lot better.

I think, if we pass through a maxResults estimate into org.hibernate.sql.results.spi.ResultsConsumer#consume, we could reduce the possible negative effects to only HQL/Criteria queries that select entities where we have no good estimate. In the end, I think we simply need guidance from the user. The user should be able to provide a hint that says "don't deduplicate", in which case we can avoid the IdentityHashMap.

beikov avatar Aug 10 '22 14:08 beikov

IMO conceptually we'd need a https://en.wikipedia.org/wiki/Bloom_filter Not keen to add a dependency though - could we check how close we get approximating it via an identity hashmap ?

Sanne avatar Aug 10 '22 15:08 Sanne

IMO conceptually we'd need a https://en.wikipedia.org/wiki/Bloom_filter Not keen to add a dependency though - could we check how close we get approximating it via an identity hashmap ?

Yeah, either add a dependency or implement/integrate this ourselves in hibernate-core. It's not trivial to implement this, but I think it would be worth it to add a custom implementation. Standard Bloom filters AFAIU need a fixed size, but since we keep the data around, we could grow the bit set on-demand similar to what a hash table does, but use a lot less memory (64 times memory savings i.e. for 1M elements we'd use ~2KB instead of ~1MB). Growing a bit set also consumes a lot less memory, though rehashing is still necessary.

beikov avatar Aug 10 '22 18:08 beikov