trino
trino copied to clipboard
Project Hummingbird
Trino has had a columnar/vectorized evaluation engine since its inception in 2012. After the initial implementation and optimization, and once we were satisfied with the performance for the majority of the use cases, we focused our efforts in other areas. Although we've made further incremental performance improvements in the past few years, there is still room for further optimization.
We're starting Project Hummingbird with the goal of bringing Trino's columnar/vectorized evaluation engine to the next level. This includes improvements in areas such as filter, projection, aggregation and join evaluation, as well as any other potential improvements in areas we identify along the way. So far, we have the following list:
- Megamorphism and virtual dispatch in core loops due to call sites seeing multitude of block types
- Suboptimal code generation for complex expressions and required null checks
- Adaptive scalar operator fusion in filter / projection evaluation loop
- Adaptive expression evaluation that leverages runtime cost and selectivity of expressions
- Inefficiencies in creating and filling blocks (BlockBuilder abstraction)
- Opportunities for block-specific evaluation and short-circuiting
- RLE/dictionary optimizations when evaluating aggregations and subexpressions in filters and projections
- Runtime per-block traits for specialized, data-dependent processing logic:
- null presence and null propagation
- eliminate overflow checks for operations on small numbers
- eliminate nan handling for operations on data without nans
- eliminate size checks when variable length data is known to be small
- optimized algorithms for ASCII-only data
- Introduce abstractions and batch calling conventions to facilitate the implementation of functions and operators that can leverage SIMD instructions via Java's new Vector API, and, in the future, possibly GPUs via OpenCL or CUDA
- Improve management of intermediate data buffers across operator boundaries
- Specialized hash table implementations for small cardinalities
- Optimized storage of GROUP BY intermediates for fixed-size types to improve memory locality and avoid multiple indirections.
- Take advantage of new JVM features such as VarHandles, MemorySegment and MemoryLayout APIs
- Push down selection masks into connectors to improve I/O and decoding performance in certain cases
- Parquet reader optimizations to bring it on par with the ORC reader.
Tasks
- https://github.com/trinodb/trino/pull/14178