greptimedb icon indicating copy to clipboard operation
greptimedb copied to clipboard

Tracking Issue: migrate Arrow/Parquet to the official implementation.

Open waynexia opened this issue 2 years ago • 8 comments

What type of enhancement is this?

Tech debt reduction

What does the enhancement do?

As discussed in https://github.com/GreptimeTeam/greptimedb/discussions/388, we decide to migrate the arrow/parquet implementation to the official version. This issue tracks the progress of this work.

Implementation challenges

Major API changes

  • DataFusion
    • ExecutionPlan::execute change to sync https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#tymethod.execute
      	  fn execute(
      	      &self,
      	      partition: usize,
      	      context: Arc<TaskContext>
      	  ) -> Result<SendableRecordBatchStream>
      
    • new interfaces of ExecutionPlan
      • 		  fn output_ordering(&self) -> Option<&[PhysicalSortExpr]>
        
      • 		  fn statistics(&self) -> Statistics
        
    • new TaskContext https://docs.rs/datafusion/latest/datafusion/execution/context/struct.TaskContext.html
      • (wrapper over RuntimeEnv, with more functionality)
    • TableProvider is separate into TableProvider(physical stage) and TableSource(logical stage) https://docs.rs/datafusion-expr/latest/datafusion_expr/trait.TableSource.html
  • Arrow
    • Array trait
      • apache: https://docs.rs/arrow/latest/arrow/array/trait.Array.html
      • arrow2 (v0.10): https://docs.rs/arrow2/0.10.1/arrow2/array/trait.Array.html
      • the major difference is around validity
        		  fn validity(&self) -> Option<&Bitmap>
        
        		  fn with_validity(&self, validity: Option<Bitmap>) -> Box<dyn Array>
        
    • arrow-format https://github.com/DataEngineeringLabs/arrow-format also needs to migrate to Apache Arrow
  • Parquet
    • Mainly used in ParquetWriter (read/write sst) and statistics (predicate).

Works before migration

  • [x] Change ExecutionPlan::execute to sync method @waynexia #573
  • [x] ~Specify output_ordering for our ExecutionPlan impl~
  • [x] Adapt TableSource @v0y4g3r
  • [x] (TBD: TaskContext)
  • [x] ~add proxy method like get_validity for ArrayRef~
  • [x] (TBD: parquet part)
  • [x] use datatypes::arrow instead of arrow directly @evenyag #571
  • [x] eliminate the usage of MutableBitmap @evenyag #610

Migrating

Working on branch replace-arrow2

Target versions:

  • DataFusion: 14.0.0
  • Arrow: 26.0.0
  • Parquet: 26.0.0

Migrating datatypes

Branch datatypes2 is working on migrating our vectors to arrow

  • [x] PrimitiveVector @evenyag #633
  • [x] DateVector @evenyag #651
  • [x] DateTimeVector @v0y4g3r #661
  • [x] TimestampVector @v0y4g3r #686
  • [x] StringVector @waynexia #659
  • [x] ConstantVector @waynexia #680
  • [x] NullVector @evenyag #658
  • [x] ListVector @evenyag #681
  • [x] Validity @evenyag #684
  • [x] ColumnSchema and Schema @evenyag #695

TODO

  • [ ] Implements VectorOp::cast() and refactor create_current_timestamp_vector()
  • [ ] Refactor ListValue and ListValueRef
  • [ ] Better distinguishes logical type and physical type
  • [ ] Avoid using match to handle timestamp data type
  • [ ] #754
  • [ ] Document that describes the design
  • [ ] Remove version from Schema #349
  • [x] Refactor RecordBatch and remove arrow_array_get()
  • [ ] Refactor Timestamp, maybe put all time unit as a flat enum variant
  • [ ] Refactor ConcreteDataType, move is_xxx to properties
  • [ ] Test JSON serialization output of values, datatypes, schema, column schema...
  • [ ] Refactor usage of Helper::static_cast(), avoid checking whether the vector is const in user codes
  • [ ] More tests for parquet reader
  • [ ] Consider using ordered float as wrapper type for float vectors.
  • [ ] Move query/tests to query/src
  • [ ] script/python/builtins/mod.rs move to script/python/builtins.rs, also check other mods
  • [ ] Refactor script mod to use our RecordBatch/Vector/Schema
  • [ ] Wrap push_value_ref().unwrap() for MutableVector

Migrating functions

  • [x] diff #715
  • [x] argmax @evenyag #716
  • [x] argmin @evenyag
  • [x] mean @waynexia #717
  • [x] median @evenyag
  • [x] percentile @evenyag
  • [x] polyval @waynexia #717
  • [x] scipy_stats_norm_cdf @evenyag
  • [x] scipy_stats_norm_pdf @evenyag
  • [x] pow @waynexia #717
  • [x] rate @evenyag
  • [x] clip @evenyag
  • [x] interp @waynexia #717

TODO

  • [x] test test_clip_fn_* may fail with SIGSEVG #734

waynexia avatar Nov 17 '22 06:11 waynexia

What about Field and Schema, I remember that arrow2 and arrow have different APIs.

evenyag avatar Nov 17 '22 09:11 evenyag

add proxy method like get_validity for ArrayRef

Only a few places use get_validity(), I think we could skip this step.

BTW, we also use methods in arrow::compute https://github.com/GreptimeTeam/greptimedb/blob/6d762aa9dcad7f4edb9e1019688e6eb25b14d7d8/src/script/src/python/vector.rs#L115-L122

They might have different semantic

evenyag avatar Nov 18 '22 06:11 evenyag

They might have different semantic

* [arrow2 CastOptions](https://docs.rs/arrow2/latest/arrow2/compute/cast/struct.CastOptions.html)

* [arrow CastOptions](https://docs.rs/arrow/latest/arrow/compute/struct.CastOptions.html)

Yes, so many API changes. For this particular case, I think we can set safe to true in the new option.

Gladly (or maybe sadly 🥲) it looks like we don't have this kind of boundary-case tests. I almost see the future that we've paid lots of care to review the change (I hope so 🤪) but still ignore some critical changes that will take us a day to trace it down.

waynexia avatar Nov 18 '22 14:11 waynexia

We may also need to refactor Vector to separate vector's logical type from it's internal representation like discussed in #203

v0y4g3r avatar Nov 21 '22 06:11 v0y4g3r

We may also need to refactor Vector to separate vector's logical type from it's internal representation like discussed in #203

Looks #203 is suggesting a non-trivial change. We can dive into it after this migration.

waynexia avatar Nov 22 '22 08:11 waynexia

I'm working on the new vectors based on the official arrow in branch datatypes2. Now I implement the boolean and binary vector. The next step is porting the primitive vector. @waynexia @v0y4g3r

evenyag avatar Nov 23 '22 06:11 evenyag

  • test test_clip_fn_* may fail with SIGSEVG

I reproduced this BUG in another repo and find out that it is highly related to the usage of the with_match_primitive_type_id macro.

The AddressSanitizer also says that the stack overflowed when running the tests.

=================================================================
==71205==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x00016d433f60 at pc 0x000102ca8210 bp 0x00016d3caf90 sp 0x00016d3caf88
WRITE of size 8 at 0x00016d433f60 thread T2
    #0 0x102ca820c in debug_datatypes::macros::tests::eval_i64::h9032765d623f5419 macros.rs:178

The define_eval expands to about 11 * 11 * 11 = 1331 (there are 10 primitive types and an unreachable branch) match arms. The overflow should be caused by rust-lang/rust#34283. Thanks for providing this helpful hint @discord9.

evenyag avatar Dec 09 '22 08:12 evenyag

arrow only provides a generic version of arithmetic operation, but arrow2 supports passing scalar dynamically

arrow:

pub fn add_scalar_dyn<T>(array: &dyn Array, scalar: T::Native) -> Result<ArrayRef>
where
    T: ArrowNumericType,
    T::Native: ArrowNativeTypeOp,
{}

arrow2:

pub fn add_scalar(lhs: &dyn Array, rhs: &dyn Scalar) -> Box<dyn Array> {}

evenyag avatar Dec 14 '22 07:12 evenyag

Is this error relavate to this issue? This function is not necessary, but it is useful in time serise query.

图片

DiamondMofeng avatar Feb 03 '23 03:02 DiamondMofeng

Sorry, I didn't see there is an available now() function, which is an alternative. 图片

DiamondMofeng avatar Feb 03 '23 03:02 DiamondMofeng

This function is not necessary, but it is useful in time serise query

Unfortunately, we don't support Time64 data type now. Note that the Timestamp type is different from the Time type.

evenyag avatar Feb 03 '23 06:02 evenyag

What's the status of this tracking issue right now? Can we close it or keep it? @evenyag @v0y4g3r

killme2008 avatar Jun 20 '23 10:06 killme2008

What's the status of this tracking issue right now? Can we close it or keep it? @evenyag @v0y4g3r

I think it's sufficient to close this tracking issue. The unresolved todos mainly relate to data type refactoring, it's better to open another issue for that topic.

v0y4g3r avatar Jun 20 '23 13:06 v0y4g3r