datafusion
datafusion copied to clipboard
[draft] Add `LogicalType`, try to support user-defined types
Which issue does this PR close?
Closes #7923 Follows up on #8143, which is stale.
In the current state the PR is a draft implementation to validate the idea based on the discussion from #7421.
New additions
LogicalTypeenum.ExtensionTypetrait. Abstraction for extension types.TypeSignaturestruct. Uniquely identifies a data type.LogicalSchema&LogicalField, equivalent to arrow'sSchemaandField- note: this was mostly a shortcut to be able to refactor somewhat easily
without changing much of the logic of of
DFSchema. In next iterationsDFSchemaandLogicalSchemacould potentially merge.
- note: this was mostly a shortcut to be able to refactor somewhat easily
without changing much of the logic of of
Major changes
DFSchemausesLogicalSchema&LogicalField.ExprSchemableandExprSchemanow useLogicalType.astto logical plan conversion now usesLogicalType.
To be implemented
- Registering extension types through a
ContextProvider.
To be determined
[!NOTE] Most of these open questions remain similar to the initial PR.
- Should
ScalarValueuseLogicalTypeor arrowDataType?- [ ]
LogicalType - [ ]
DataType
- [ ]
- Should
TableSourcereturnDFSchemaor arrowSchema?- [ ]
Schema - [ ]
DFSchema
- [ ]
- Conversion between physical types and logical types (in Datafusion, type conversion is achieved through the conversion of
DFSchematoSchema; logical plans useDFSchema, physical plans useSchema). - Conversion between
SchemaandDFSchema- When to convert
SchematoDFSchema?- [ ] During the construction of the logical
TableScannode, obtain arrowSchemathroughTableSource/TableProviderand then convert it toDFSchema. - [ ]
TableSource/TableProviderreturnsDFSchemainstead ofSchema.
- [ ] During the construction of the logical
- When to convert
DFSchematoSchema?- [ ] Directly obtain arrow
SchemafromTableSourcein physical planner, no need for conversion. - [ ] Convert the
DFSchemareturned byTableSourcetoSchemain the physical planner stage.
- [ ] Directly obtain arrow
- When to convert
I plan to check this out shortly -- thanks @notfilippo
@alamb I've added an example with some comments and TODOs remarking my open questions.
The biggest challenge of making this kind of change I think will be to manage the rollout and migration with downstream crates / make the transition as smooth as possible.
Completely agree. I will try to experiment with the user facing APIs (e.g. what's returned by the schema() method of structs implementing TableSource / TableProvider). I also think that most of the usage of DataType in physical and logical plans should be replaced with TypeRelation. I'll try to open a draft of what that would look like in comet in the coming days.
Thanks @notfilippo -- I will try and get the other projects I have under way to a better state so I can more fully help plan / communicate / coordinate this one.
So I can more fully help plan / communicate / coordinate this one.
Sounds good! Feel free to follow up here / on slack / on discord 😄
I've also noticed that this potential change could greatly benefit the substrait encoding / decoding of the logical plan. Its current implementation has troubles dealing with dictionaries. I'll look into that as well while waiting for further instructions.
@alamb, @wjones127, @jayzhan211 -- I've found some time to finally draft a proposal: [Proposal] Decouple logical from physical types
(this draft PR was updated to DataFusion v40 in order to test datafusion-comet)
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.