datafusion
datafusion copied to clipboard
[draft] Add `LogicalType`, try to support user-defined types
Which issue does this PR close?
Closes #7923 .
Current Pull Request is an Experimental Demo for Validating the Feasibility of Logical Types
Rationale for this change
What changes are included in this PR?
Features
- Create User-Defined Types (UDTs) through SQL, specifying the field types as UDTs during table creation.
- Support the use of
UDT
as a function signature inudf/udaf
. - Register extension types through the
register_data_type
function in theSessionContext
.
New Additions
-
LogicalType
struct. -
ExtensionType
trait. Abstraction for extension types. -
TypeSignature
struct. Uniquely identifies a data type.
Major Changes
- Added
get_data_type(&self, _name: &TypeSignature) -> Option<LogicalType>
function to theContextProvider
trait. - In
DFSchema
,DFField
now usesLogicalType
, removing arrowField
and retaining onlydata_type
,nullable
,metadata
sincedict_id
,dict_is_ordered
are not necessary at the logical stage. -
ExprSchemable
andExprSchema
now useLogicalType
. -
ast
to logical plan conversion now usesLogicalType
.
To Be Implemented
-
TypeCoercionRewriter
in the analyze stage uses logical types. For example, functions likecomparison_coercion
,get_input_types
,get_valid_types
, etc. - Functions signatures for
udf/udaf
useTypeSignature
instead of the existingDataType
for ease of use inudf/udaf
.
To Be Determined
- Should
ScalarValue
useLogicalType
or arrowDataType
?- [ ]
LogicalType
. - [ ]
DataType
- [ ]
- Should
TableSource
returnDFSchema
or arrowSchema
?- [ ]
Schema
. - [ ]
DFSchema
- [ ]
- Conversion between physical types and logical types (in Datafusion, type conversion is achieved through the conversion of
DFSchema
toSchema
; logical plans useDFSchema
, physical plans useSchema
). - Conversion between
Schema
andDFSchema
- When to convert
Schema
toDFSchema
?- [ ] During the construction of the logical
TableScan
node, obtain arrowSchema
throughTableSource/TableProvider
and then convert it toDFSchema
. - [ ]
TableSource/TableProvider
returnsDFSchema
instead ofSchema
.
- [ ] During the construction of the logical
- When to convert
DFSchema
toSchema
?- [ ] Directly obtain arrow
Schema
fromTableSource
in physical planner, no need for conversion. - [ ] Convert the
DFSchema
returned byTableSource
toSchema
in the physical planner stage.
- [ ] Directly obtain arrow
- When to convert
Some Thoughts
- In this comment, the use case of converting from
dyn Array
toLineStringArray
orMultiPointArray
was raised. In my perspective, assuming there is a function specifically designed for handlingLineString
data, the function signature can be defined asLineString
, ensuring that the input data must be of a type acceptable byLineStringArray
.
Are these changes tested?
Are there any user-facing changes?
Current PR has some unresolved issues requiring collaboration for discussion. Once there is a consensus on all the issues among the team, I will reorganize the PR accordingly.
I've organized the logic for the mutual conversion between DFSchema
and Schema
in datafusion. In theory, there should be no conversion logic from Schema
to DFSchema
. I've outlined all the modifications below.
DFSchema to Schema
No need to change
DefaultPhysicalPlanner
- DescribeTable
- Values -> ValuesExec
- EmptyRelation -> EmptyExec
- Unnest -> UnnestExec
- CopyTo
- Explain
- Analyze
To be changed
-
[ ] TableProvider::schema
- [ ] ViewTable
- [ ] ListingTable
- [ ] EmptyTable
- [ ] MemTable
- [ ] StreamingTable
-
[ ] DataFrame
- [x] write_table: replace with DFSchema
- [ ] cache: build MemTable
Schema to DFSchema (To be changed)
- [x] LogicalPlanBuilder::insert_into: can directly use DFSchema
- [x] LogicalPlanBuilder::explain: can directly use DFSchema
- [x] ConstEvaluator: construct DFSchema then to Schema
- [x] SqlToRel::explain_to_plan: output schema can directly use DFSchema
- [x] SqlToRel::describe_table_to_plan: output schema can directly use DFSchema
- [ ] SqlToRel::insert_to_plan: depends on
table_source.schema()
- [ ] SqlToRel::delete_to_plan: depends on
table_source.schema()
- [ ] ListingTable::scan: used to create_physical_expr
Thanks @yukkit -- I plan to give this a look, but probably will not have time until tomorrow
What's the status of this pr? This should be a very useful feature.
I think this PR is stalled and I don't have any update
Please accept my apologies for the delay. Due to personal circumstances, I have been unable to attend to any work. I will now proceed to resume work on this PR.
No worries at all -- I hope all is well and we look forward to this work
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.
Hello, sorry if this is a redundant question. What is the status of this PR?
Hello, sorry if this is a redundant question. What is the status of this PR?
I think it is stale and on track to be closed from what I can see
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.
FYI https://github.com/apache/datafusion/pull/11160 tracks a new proposal for this feature. It seems to be gaining traction