datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[draft] Add `LogicalType`, try to support user-defined types

Open yukkit opened this issue 1 year ago • 8 comments

Which issue does this PR close?

Closes #7923 .

Current Pull Request is an Experimental Demo for Validating the Feasibility of Logical Types

Rationale for this change

What changes are included in this PR?

Features

  • Create User-Defined Types (UDTs) through SQL, specifying the field types as UDTs during table creation.
  • Support the use of UDT as a function signature in udf/udaf.
  • Register extension types through the register_data_type function in the SessionContext.

New Additions

  • LogicalType struct.
  • ExtensionType trait. Abstraction for extension types.
  • TypeSignature struct. Uniquely identifies a data type.

Major Changes

  • Added get_data_type(&self, _name: &TypeSignature) -> Option<LogicalType> function to the ContextProvider trait.
  • In DFSchema, DFField now uses LogicalType, removing arrow Field and retaining only data_type, nullable, metadata since dict_id, dict_is_ordered are not necessary at the logical stage.
  • ExprSchemable and ExprSchema now use LogicalType.
  • ast to logical plan conversion now uses LogicalType.

To Be Implemented

  • TypeCoercionRewriter in the analyze stage uses logical types. For example, functions like comparison_coercion, get_input_types, get_valid_types, etc.
  • Functions signatures for udf/udaf use TypeSignature instead of the existing DataType for ease of use in udf/udaf.

To Be Determined

  • Should ScalarValue use LogicalType or arrow DataType?
    • [ ] LogicalType.
    • [ ] DataType
  • Should TableSource return DFSchema or arrow Schema?
    • [ ] Schema.
    • [ ] DFSchema
  • Conversion between physical types and logical types (in Datafusion, type conversion is achieved through the conversion of DFSchema to Schema; logical plans use DFSchema, physical plans use Schema).
  • Conversion between Schema and DFSchema
    • When to convert Schema to DFSchema?
      • [ ] During the construction of the logical TableScan node, obtain arrow Schema through TableSource/TableProvider and then convert it to DFSchema.
      • [ ] TableSource/TableProvider returns DFSchema instead of Schema.
    • When to convert DFSchema to Schema?
      • [ ] Directly obtain arrow Schema from TableSource in physical planner, no need for conversion.
      • [ ] Convert the DFSchema returned by TableSource to Schema in the physical planner stage.

Some Thoughts

  • In this comment, the use case of converting from dyn Array to LineStringArray or MultiPointArray was raised. In my perspective, assuming there is a function specifically designed for handling LineString data, the function signature can be defined as LineString, ensuring that the input data must be of a type acceptable by LineStringArray.

Are these changes tested?

Are there any user-facing changes?

yukkit avatar Nov 12 '23 08:11 yukkit

Current PR has some unresolved issues requiring collaboration for discussion. Once there is a consensus on all the issues among the team, I will reorganize the PR accordingly.

yukkit avatar Nov 12 '23 08:11 yukkit

I've organized the logic for the mutual conversion between DFSchema and Schema in datafusion. In theory, there should be no conversion logic from Schema to DFSchema. I've outlined all the modifications below.

DFSchema to Schema

No need to change

DefaultPhysicalPlanner

  • DescribeTable
  • Values -> ValuesExec
  • EmptyRelation -> EmptyExec
  • Unnest -> UnnestExec
  • CopyTo
  • Explain
  • Analyze

To be changed

  • [ ] TableProvider::schema

    • [ ] ViewTable
    • [ ] ListingTable
    • [ ] EmptyTable
    • [ ] MemTable
    • [ ] StreamingTable
  • [ ] DataFrame

    • [x] write_table: replace with DFSchema
    • [ ] cache: build MemTable

Schema to DFSchema (To be changed)

  • [x] LogicalPlanBuilder::insert_into: can directly use DFSchema
  • [x] LogicalPlanBuilder::explain: can directly use DFSchema
  • [x] ConstEvaluator: construct DFSchema then to Schema
  • [x] SqlToRel::explain_to_plan: output schema can directly use DFSchema
  • [x] SqlToRel::describe_table_to_plan: output schema can directly use DFSchema
  • [ ] SqlToRel::insert_to_plan: depends on table_source.schema()
  • [ ] SqlToRel::delete_to_plan: depends on table_source.schema()
  • [ ] ListingTable::scan: used to create_physical_expr

yukkit avatar Nov 13 '23 12:11 yukkit

Thanks @yukkit -- I plan to give this a look, but probably will not have time until tomorrow

alamb avatar Nov 13 '23 18:11 alamb

What's the status of this pr? This should be a very useful feature.

lewiszlw avatar Jan 10 '24 08:01 lewiszlw

I think this PR is stalled and I don't have any update

alamb avatar Jan 11 '24 19:01 alamb

Please accept my apologies for the delay. Due to personal circumstances, I have been unable to attend to any work. I will now proceed to resume work on this PR.

yukkit avatar Mar 09 '24 07:03 yukkit

No worries at all -- I hope all is well and we look forward to this work

alamb avatar Mar 09 '24 09:03 alamb

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar May 09 '24 01:05 github-actions[bot]

Hello, sorry if this is a redundant question. What is the status of this PR?

therealsharath avatar May 13 '24 23:05 therealsharath

Hello, sorry if this is a redundant question. What is the status of this PR?

I think it is stale and on track to be closed from what I can see

alamb avatar May 14 '24 19:05 alamb

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jul 14 '24 01:07 github-actions[bot]

FYI https://github.com/apache/datafusion/pull/11160 tracks a new proposal for this feature. It seems to be gaining traction

alamb avatar Jul 22 '24 11:07 alamb