vortex
vortex copied to clipboard
Sorting capabilities on arrays
Sorting data is obviously a common use case, but these are a few features that I think would be useful:
- Sorting a single array, for some types it could even happen without moving data
- Lexical sort using multiple columns
- Sorting to indices which gives us an array to use with
take, and which passing around or holding in memory is much cheaper. - Sorting with a limit - only returning
limititems - Sorting can be in both ascending and descending orders.
References
- Arrow's sort module
- Foundationdb's multi-value tuple serialization spec
I assume Vortex can do sorting of some of its encoded/compressed data faster than the sort once it's Arrow data? If so you may be interested in this change: https://github.com/apache/datafusion/pull/17337. This will eventually allow a table scan to say "I can provide sorted data" or at least for the scan to produce sorted data (so that a smart sort algorithm upstream can use insertion sort or something) which practically gives a format like Vortex the ability to do it's own sorting.