greptimedb
greptimedb copied to clipboard
Tracking Issue: migrate Arrow/Parquet to the official implementation.
What type of enhancement is this?
Tech debt reduction
What does the enhancement do?
As discussed in https://github.com/GreptimeTeam/greptimedb/discussions/388, we decide to migrate the arrow/parquet implementation to the official version. This issue tracks the progress of this work.
Implementation challenges
Major API changes
- DataFusion
-
ExecutionPlan::execute
change to sync https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#tymethod.executefn execute( &self, partition: usize, context: Arc<TaskContext> ) -> Result<SendableRecordBatchStream>
- new interfaces of
ExecutionPlan
-
fn output_ordering(&self) -> Option<&[PhysicalSortExpr]>
-
fn statistics(&self) -> Statistics
-
- new
TaskContext
https://docs.rs/datafusion/latest/datafusion/execution/context/struct.TaskContext.html- (wrapper over
RuntimeEnv
, with more functionality)
- (wrapper over
-
TableProvider
is separate intoTableProvider
(physical stage) andTableSource
(logical stage) https://docs.rs/datafusion-expr/latest/datafusion_expr/trait.TableSource.html
-
- Arrow
-
Array
trait- apache: https://docs.rs/arrow/latest/arrow/array/trait.Array.html
- arrow2 (v0.10): https://docs.rs/arrow2/0.10.1/arrow2/array/trait.Array.html
- the major difference is around
validity
fn validity(&self) -> Option<&Bitmap> fn with_validity(&self, validity: Option<Bitmap>) -> Box<dyn Array>
-
arrow-format
https://github.com/DataEngineeringLabs/arrow-format also needs to migrate to Apache Arrow
-
- Parquet
- Mainly used in
ParquetWriter
(read/write sst) and statistics (predicate).
- Mainly used in
Works before migration
- [x] Change
ExecutionPlan::execute
to sync method @waynexia #573 - [x] ~Specify
output_ordering
for ourExecutionPlan
impl~ - [x] Adapt
TableSource
@v0y4g3r - [x] (TBD:
TaskContext
) - [x] ~add proxy method like
get_validity
forArrayRef
~ - [x] (TBD: parquet part)
- [x] use
datatypes::arrow
instead ofarrow
directly @evenyag #571 - [x] eliminate the usage of
MutableBitmap
@evenyag #610
Migrating
Working on branch replace-arrow2
Target versions:
- DataFusion:
14.0.0
- Arrow:
26.0.0
- Parquet:
26.0.0
Migrating datatypes
Branch datatypes2 is working on migrating our vectors to arrow
- [x] PrimitiveVector @evenyag #633
- [x] DateVector @evenyag #651
- [x] DateTimeVector @v0y4g3r #661
- [x] TimestampVector @v0y4g3r #686
- [x] StringVector @waynexia #659
- [x] ConstantVector @waynexia #680
- [x] NullVector @evenyag #658
- [x] ListVector @evenyag #681
- [x] Validity @evenyag #684
- [x] ColumnSchema and Schema @evenyag #695
TODO
- [ ] Implements
VectorOp::cast()
and refactorcreate_current_timestamp_vector()
- [ ] Refactor ListValue and ListValueRef
- [ ] Better distinguishes logical type and physical type
- [ ] Avoid using match to handle timestamp data type
- [ ] #754
- [ ] Document that describes the design
- [ ] Remove version from Schema #349
- [x] Refactor RecordBatch and remove
arrow_array_get()
- [ ] Refactor Timestamp, maybe put all time unit as a flat enum variant
- [ ] Refactor ConcreteDataType, move
is_xxx
to properties - [ ] Test JSON serialization output of values, datatypes, schema, column schema...
- [ ] Refactor usage of
Helper::static_cast()
, avoid checking whether the vector is const in user codes - [ ] More tests for parquet reader
- [ ] Consider using ordered float as wrapper type for float vectors.
- [ ] Move
query/tests
toquery/src
- [ ]
script/python/builtins/mod.rs
move toscript/python/builtins.rs
, also check other mods - [ ] Refactor script mod to use our
RecordBatch
/Vector
/Schema
- [ ] Wrap
push_value_ref().unwrap()
for MutableVector
Migrating functions
- [x] diff #715
- [x] argmax @evenyag #716
- [x] argmin @evenyag
- [x] mean @waynexia #717
- [x] median @evenyag
- [x] percentile @evenyag
- [x] polyval @waynexia #717
- [x] scipy_stats_norm_cdf @evenyag
- [x] scipy_stats_norm_pdf @evenyag
- [x] pow @waynexia #717
- [x] rate @evenyag
- [x] clip @evenyag
- [x] interp @waynexia #717
TODO
- [x] test
test_clip_fn_*
may fail with SIGSEVG #734
What about Field
and Schema
, I remember that arrow2 and arrow have different APIs.
add proxy method like get_validity for ArrayRef
Only a few places use get_validity()
, I think we could skip this step.
BTW, we also use methods in arrow::compute
https://github.com/GreptimeTeam/greptimedb/blob/6d762aa9dcad7f4edb9e1019688e6eb25b14d7d8/src/script/src/python/vector.rs#L115-L122
They might have different semantic
They might have different semantic
* [arrow2 CastOptions](https://docs.rs/arrow2/latest/arrow2/compute/cast/struct.CastOptions.html) * [arrow CastOptions](https://docs.rs/arrow/latest/arrow/compute/struct.CastOptions.html)
Yes, so many API changes. For this particular case, I think we can set safe
to true
in the new option.
Gladly (or maybe sadly 🥲) it looks like we don't have this kind of boundary-case tests. I almost see the future that we've paid lots of care to review the change (I hope so 🤪) but still ignore some critical changes that will take us a day to trace it down.
We may also need to refactor Vector
to separate vector's logical type from it's internal representation like discussed in #203
We may also need to refactor
Vector
to separate vector's logical type from it's internal representation like discussed in #203
Looks #203 is suggesting a non-trivial change. We can dive into it after this migration.
I'm working on the new vectors based on the official arrow
in branch datatypes2. Now I implement the boolean and binary vector. The next step is porting the primitive vector. @waynexia @v0y4g3r
- test
test_clip_fn_*
may fail with SIGSEVG
I reproduced this BUG in another repo and find out that it is highly related to the usage of the with_match_primitive_type_id macro.
The AddressSanitizer also says that the stack overflowed when running the tests.
=================================================================
==71205==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x00016d433f60 at pc 0x000102ca8210 bp 0x00016d3caf90 sp 0x00016d3caf88
WRITE of size 8 at 0x00016d433f60 thread T2
#0 0x102ca820c in debug_datatypes::macros::tests::eval_i64::h9032765d623f5419 macros.rs:178
The define_eval
expands to about 11 * 11 * 11 = 1331
(there are 10 primitive types and an unreachable branch) match arms. The overflow should be caused by rust-lang/rust#34283. Thanks for providing this helpful hint @discord9.
arrow only provides a generic version of arithmetic operation, but arrow2 supports passing scalar dynamically
arrow:
pub fn add_scalar_dyn<T>(array: &dyn Array, scalar: T::Native) -> Result<ArrayRef>
where
T: ArrowNumericType,
T::Native: ArrowNativeTypeOp,
{}
arrow2:
pub fn add_scalar(lhs: &dyn Array, rhs: &dyn Scalar) -> Box<dyn Array> {}
Is this error relavate to this issue? This function is not necessary, but it is useful in time serise query.
Sorry, I didn't see there is an available now()
function, which is an alternative.
This function is not necessary, but it is useful in time serise query
Unfortunately, we don't support Time64
data type now. Note that the Timestamp
type is different from the Time
type.
What's the status of this tracking issue right now? Can we close it or keep it? @evenyag @v0y4g3r
What's the status of this tracking issue right now? Can we close it or keep it? @evenyag @v0y4g3r
I think it's sufficient to close this tracking issue. The unresolved todos mainly relate to data type refactoring, it's better to open another issue for that topic.