project-m36 examine optimizations for projection

Based on new benchmarks, projection is unnecessarily expensive. Projection on 1000 tuples take 50 seconds, which is really excessive. Profiling reveals that duplicate tuple detection (tuple hashing) is the cause.

Some potential optimizations are:

elimination of deduplication on projection on key attributes - if the attributes passed in cover a candidate key, then no deduplication (hashing) is required
replacing relFold + union with something targeting tuples instead of relations
deferring deduplication until the projection is complete
bloom filters for the hash values as a first pass

Are there some other ideas?

Dec 25 '17 19:12 agentm

Union, which is used by projection, is probably also unnecessarily expensive.

Dec 26 '17 18:12 agentm

A basic optimization for projection has been added in 04fd33c4efd28a0c7c8ef8d95bd163adc9d8ab5b.

Jan 06 '18 16:01 agentm

union is still expensive due to RelationTuple's Eq instance relying on attributesEqual called by the hashing mechanism. In most cases, that's extraneous validation since the types have already been validated. It could make sense to include multiple forms of equality.

Jan 06 '18 17:01 agentm