datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[draft] Add `LogicalType`, try to support user-defined types

Open notfilippo opened this issue 1 year ago • 1 comments

Which issue does this PR close?

Closes #7923 Follows up on #8143, which is stale.

In the current state the PR is a draft implementation to validate the idea based on the discussion from #7421.

New additions

  • LogicalType enum.
  • ExtensionType trait. Abstraction for extension types.
  • TypeSignature struct. Uniquely identifies a data type.
  • LogicalSchema & LogicalField, equivalent to arrow's Schema and Field
    • note: this was mostly a shortcut to be able to refactor somewhat easily without changing much of the logic of of DFSchema. In next iterations DFSchema and LogicalSchema could potentially merge.

Major changes

  • DFSchema uses LogicalSchema & LogicalField.
  • ExprSchemable and ExprSchema now use LogicalType.
  • ast to logical plan conversion now uses LogicalType.

To be implemented

  • Registering extension types through a ContextProvider.

To be determined

[!NOTE] Most of these open questions remain similar to the initial PR.

  • Should ScalarValue use LogicalType or arrow DataType?
    • [ ] LogicalType
    • [ ] DataType
  • Should TableSource return DFSchema or arrow Schema?
    • [ ] Schema
    • [ ] DFSchema
  • Conversion between physical types and logical types (in Datafusion, type conversion is achieved through the conversion of DFSchema to Schema; logical plans use DFSchema, physical plans use Schema).
  • Conversion between Schema and DFSchema
    • When to convert Schema to DFSchema?
      • [ ] During the construction of the logical TableScan node, obtain arrow Schema through TableSource/TableProvider and then convert it to DFSchema.
      • [ ] TableSource/TableProvider returns DFSchema instead of Schema.
    • When to convert DFSchema to Schema?
      • [ ] Directly obtain arrow Schema from TableSource in physical planner, no need for conversion.
      • [ ] Convert the DFSchema returned by TableSource to Schema in the physical planner stage.

notfilippo avatar Jun 28 '24 09:06 notfilippo

I plan to check this out shortly -- thanks @notfilippo

alamb avatar Jun 30 '24 12:06 alamb

@alamb I've added an example with some comments and TODOs remarking my open questions.

notfilippo avatar Jul 02 '24 14:07 notfilippo

The biggest challenge of making this kind of change I think will be to manage the rollout and migration with downstream crates / make the transition as smooth as possible.

Completely agree. I will try to experiment with the user facing APIs (e.g. what's returned by the schema() method of structs implementing TableSource / TableProvider). I also think that most of the usage of DataType in physical and logical plans should be replaced with TypeRelation. I'll try to open a draft of what that would look like in comet in the coming days.

notfilippo avatar Jul 03 '24 14:07 notfilippo

Thanks @notfilippo -- I will try and get the other projects I have under way to a better state so I can more fully help plan / communicate / coordinate this one.

alamb avatar Jul 05 '24 15:07 alamb

So I can more fully help plan / communicate / coordinate this one.

Sounds good! Feel free to follow up here / on slack / on discord 😄

I've also noticed that this potential change could greatly benefit the substrait encoding / decoding of the logical plan. Its current implementation has troubles dealing with dictionaries. I'll look into that as well while waiting for further instructions.

notfilippo avatar Jul 05 '24 19:07 notfilippo

@alamb, @wjones127, @jayzhan211 -- I've found some time to finally draft a proposal: [Proposal] Decouple logical from physical types

(this draft PR was updated to DataFusion v40 in order to test datafusion-comet)

notfilippo avatar Jul 17 '24 12:07 notfilippo

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Sep 17 '24 01:09 github-actions[bot]