vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Vortex Roadmap

Open gatesn opened this issue 10 months ago • 5 comments

Bindings + Integrations

  • [x] C++ API to write/scan Arrow #3720
  • [x] Java API to write/scan Arrow
  • [x] Python API to write/scan Arrow
  • [ ] cuDF (+ https://github.com/sirius-db/sirius)
  • [ ] Numpy (#3632)

Arrays, Layouts, and DTypes

  • [x] Make canonical List use a ListView encoding
  • [ ] Add layouts for variable-length types, e.g. ListLayout, VarBinLayout
  • [ ] Variant DType
  • [ ] FixedSizeList + Tensors
  • [ ] FixedBinary DType, provides aligned fixed width binary arrays. Can be used to implement extension arrays for i128, i256, UUID, etc.

Functionality

  • [x] Add compact ArrayOperation to remove unused data from an array, e.g. dict, varbinview, etc. https://github.com/vortex-data/vortex/issues/1798
  • [ ] Better performance for nested types, incl custom layouts / encodings
  • [ ] Improving I/O subsystem to better detect and support NVMe, EBS, and object store latencies, and Direct IO. #3119
  • [ ] Improve threading model of the scan operator to better support DuckDB- and Polars-style runtimes.
  • [ ] Add a sorted-merge operator #3106
  • [ ] Add a zip operator to scan with auxilliary columns
  • [ ] Multi-file scan (with work-stealing threaded executors)
  • [ ] Continue moving over to VortexSession
  • [ ] Stabilize IPC format, incl shared dictionaries
  • [ ] Add checksum statistic for identifying reused dictionaries
  • [ ] Improve benchmarking tool to emit more useful comparisons of strategies and configurations

House Keeping

  • [ ] Move canonical compute implementations to be next to compute fn, rather than next to the array impl
  • [ ] Move filter + take to be ArrayOperations, rather than compute functions. Compute functions then become ScalarFn.
  • [ ] Add ScalarFnExpr to encapsulate all expressions that invoke a ScalarFn (that don't require short-circuiting)
  • [ ] #3781

Research Proposals

Here's a bunch of ideas we have for potential research proposals, likely suitable for bachelors / masters level. Reach out if you're interesting in picking these up!

  • pg_vortex - a Postgres extension to provide columnar scans over Postgres block storage using Vortex layouts.
  • I/O Subsystems - experiment with responsive coalescing by adjusting I/O strategy based on observed throughput and latencies. Alternatively, avoid coalescing entirely and perform object store I/O with sticky connections.
  • Additional FastLanes codecs - add support for dictionary ~and run-length~ fastlanes encoding
  • Variant DTypes - add support for JSON-like variant data.
  • Embedded indexes - add layouts that perform pruning based on index structures, e.g. inverted indices, bloom filters, finite state transducers

gatesn avatar Jan 29 '25 12:01 gatesn

We need to add:

  • Remove the filter_evaluation from (LayoutReaders)

joseph-isaacs avatar Jul 07 '25 11:07 joseph-isaacs

Interval types? Are they just an extension dtype over different kinds of ints? For completeness it would be nice to have u128 primitive array though all you really need is fixed sized list. The idea for u128 is that you could potentially optimise the handling of them but that’s speculative

robert3005 avatar Jul 10 '25 17:07 robert3005

We need to add:

  • Remove the filter_evaluation from (LayoutReaders)

Any detail for removing filter_evaluation?

Hzc492 avatar Jul 15 '25 04:07 Hzc492

There was a discussion we had around whether filter evaluation could be replaced (merged) with projection evaluation. Currently they're subtly different.

Filter evaluation is (m1: Mask) -> m2: Mask, where len(m1) == len(m2) and true_count(m1) <= true_count(m2).

In other words, it intersects the given mask to refine it.

Projection evaluation is (m1: Mask) -> Array, where len(Array) == true_count(m1).

In other words, it selects the mask.

The difficulty comes when the user provides us with a filter expression with non-trivial null behaviour. Some filter evaluation implementations even delegate to projection internally to handle nulls.

I don't think we should change the current behaviour, but we have some other experimental ideas to improve scan performance such as removing the idea of splits and using Rust streams instead. But nothing immediately on the roadmap.

gatesn avatar Jul 15 '25 06:07 gatesn

python support windows please,as so far available in mac and linux

vkingnew avatar Sep 23 '25 16:09 vkingnew