examine optimizations for projection
Based on new benchmarks, projection is unnecessarily expensive. Projection on 1000 tuples take 50 seconds, which is really excessive. Profiling reveals that duplicate tuple detection (tuple hashing) is the cause.
Some potential optimizations are:
- elimination of deduplication on projection on key attributes - if the attributes passed in cover a candidate key, then no deduplication (hashing) is required
- replacing relFold + union with something targeting tuples instead of relations
- deferring deduplication until the projection is complete
- bloom filters for the hash values as a first pass
Are there some other ideas?
Union, which is used by projection, is probably also unnecessarily expensive.
A basic optimization for projection has been added in 04fd33c4efd28a0c7c8ef8d95bd163adc9d8ab5b.
union is still expensive due to RelationTuple's Eq instance relying on attributesEqual called by the hashing mechanism. In most cases, that's extraneous validation since the types have already been validated. It could make sense to include multiple forms of equality.