project-m36 icon indicating copy to clipboard operation
project-m36 copied to clipboard

examine optimizations for projection

Open agentm opened this issue 7 years ago • 3 comments

Based on new benchmarks, projection is unnecessarily expensive. Projection on 1000 tuples take 50 seconds, which is really excessive. Profiling reveals that duplicate tuple detection (tuple hashing) is the cause.

Some potential optimizations are:

  • elimination of deduplication on projection on key attributes - if the attributes passed in cover a candidate key, then no deduplication (hashing) is required
  • replacing relFold + union with something targeting tuples instead of relations
  • deferring deduplication until the projection is complete
  • bloom filters for the hash values as a first pass

Are there some other ideas?

agentm avatar Dec 25 '17 19:12 agentm

Union, which is used by projection, is probably also unnecessarily expensive.

agentm avatar Dec 26 '17 18:12 agentm

A basic optimization for projection has been added in 04fd33c4efd28a0c7c8ef8d95bd163adc9d8ab5b.

agentm avatar Jan 06 '18 16:01 agentm

union is still expensive due to RelationTuple's Eq instance relying on attributesEqual called by the hashing mechanism. In most cases, that's extraneous validation since the types have already been validated. It could make sense to include multiple forms of equality.

agentm avatar Jan 06 '18 17:01 agentm