Antoine Pitrou
Antoine Pitrou
I'm glad this is in! Congrats!
I didn't look at this as promised. However, I also noticed a potential problem with the current generator usage in the CSV reader that needs to be investigated: https://github.com/apache/arrow/issues/14792
By the way, can you also: 1) rebase/merge from master 2) add documentation for the new class in https://github.com/apache/arrow/blob/master/docs/source/cpp/json.rst and https://github.com/apache/arrow/blob/master/docs/source/cpp/api/formats.rst#line-separated-json ?
@NoahFournier Sorry. The project is lacking review bandwidth at the moment, so we have to prioritize work and this might unfortunately take some time.
> Are you saying we should get those done first and measure it before we think about merging this in case it turns out to be slower than the current...
For more context, the original Table/Batch sort was based on something similar to NestedValuesComparator, and I significantly improved its performance by switching to per-column sorting for cases with few keys....
I disagree about introducing non-trivial C++ code that is redundant with existing code _and_ probably sub-performant.
To be clear, the current multi-column sorting facilities don't handle recursive nesting, only the top level of nesting. To handle recursive nesting, you need to flatten the columns. This can...
But I agree with quickly benchmarking the approach in this PR compared to the existing facility. It should be easy: 1) generate a RecordBatch (respectively Table); 2) convert it to...
I have two issues here: 1. The API is ugly. 2. One cannot register different implementations for different input types. This might be an annoying limitation (and/or a performance issue...