delta
delta copied to clipboard
[Feature Request] Support data type changes for schema evolution
Feature request
Overview
Currently, schema changes are eligible for schema evolution during table appends or overwrites:
- Adding new columns
- Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType
Is it possible to support more data type changes during append operation with enabled schema evolution?
Motivation
For the history table updates, we want to keep all changed records as a newly appended record instead of overwriting schema when the data type has been changed from source with enabled schema evolution
For example, when the scale value of the decimal type is changed from 2 to 4 and the precision is kept unchanged
//i.e. data type of one column is changed from decimal(38,2) to decimal(38,4)
df.write.format("delta").option("mergeSchema", "true").mode("append").save(targetPath)
The error is Failed to merge decimal types with incompatible scale 2 and 4
Can this decimal scale change be supported in Delta schema evolution during table appends and please review other data types as well?
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
- [ ] Yes. I can contribute this feature independently.
- [x] Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
- [ ] No. I cannot contribute this feature at this time.
Thanks @frankyangdev as per our Slack conversations, this one may be a little tricky. Let me summarize some key points and add some additional points thanks to @zsxwing
- Reviewing the Parquet Logical Type Definitions, it seems that decimals are stored as logical data types backed by integers per Data types in Apache Parquet - thanks @bartosz25
- When reviewing Spark supported data types I was reminded that right now Delta Lake conversion support is
ByteType
->ShortType
->IntegerType
. We currently don’t supportIntegerType
->LongType
. This may be coming into play here - in that Spark is usingjava.math.BigDecimal
to support the conversion but Parquet itself is storing things using the various primitives. - To add to it, when reviewing ParquetSchemaConverter.scala L589-L611, this may not be a simple change.
Decimal
may use different underlying types. When a column has different types in different parquet files, there may be issues to read it back directly.
Saying this, if others can chime in and perhaps I may be missing something here and perhaps over-complicating things?
Thanks @dennyglee for the checking and explanation. I hope this data type change of decimal can be supported in schema evolution in the future. Cheers
We currently don’t support
IntegerType
->LongType
Huh... Why is that really? I naively assumed it should be just the same as ByteType
-> ShortType
-> IntegerType
.
It just so happens that this is the exact error that led me to find this issue:
# df schema is [id: bigint]
df.write.format("delta").save(path)
# new_df schema is [id: int, name: string]
new_df.write.format("delta").mode("append").option("mergeSchema", "true").save(path)
# error: Failed to merge fields 'id' and 'id'. Failed to merge incompatible data types LongType and IntegerType
What's interesting is that the following workaround actually works:
DeltaTable.forPath(spark, path).merge(new_df, "false").whenNotMatchedInsertAll().execute()
# schema after this command: [id: bigint, name: string]
Huh... Why is that really? I naively assumed it should be just the same as
ByteType
->ShortType
->IntegerType
.
That's not the same, unfortunately. ByteType, ShortType and IntegerType are all stored using int32 in the physical parquet files, but LongType is stored using int64.
DeltaTable.forPath(spark, path).merge(new_df, "false").whenNotMatchedInsertAll().execute()
This sounds a bug to me. Could you open a new issue with a small reproduction?
I don't understand these two comments, especially taken together:
We currently don’t support IntegerType -> LongType
ByteType, ShortType and IntegerType are all stored using int32 in the physical parquet files, but LongType is stored using int64
Why exactly would conversion from int32 to in64 be unsupported-- is that not a bug? Even conversion from int64 to int32 should not be a problem if the original value does not exceed int32 limits.
Why exactly would conversion from int32 to in64 be unsupported-- is that not a bug? Even conversion from int64 to int32 should not be a problem if the original value does not exceed int32 limits.
Supporting int32 to int64 will require a non trivial change because this is not supported by parquet and we have to build a layer on top of parquet in order to support this. We consider this is a feature request rather than a bug.
Thanks for the update. I don't follow the logic ending with not-a-bug, but I'm not looking to argue that and appreciate the response.
This issue appears possibly related to: https://github.com/databricks/databricks-sql-cli/issues/39
in which it seems not possible to import data from a CSV file into a bigint column.
- does anyone know if that other issue can be solved without solving this issue?
- Is there any generally recommended utility or approach to, very simply, importing a CSV file into a spark cluster (and not running into this IntegerType -> LongType problem when loading into a bigint column)?
Perhaps these questions should be directed elsewhere; in any case any suggestions would be much appreciated!
@liquidaty what you are talking about is not this issue. This issue is for automatically changing the column type such as from int32 to int64. Your issue is writing an int32 value to a column whose type is int64. Could you create a separate issue to discuss instead?
@zsxwing thank you-- OK, will open a separate issue. It seemed related because the "Failed to merge incompatible data types LongType and IntegerType" error noted in this thread by @zmeir is the same one described in the databricks-sql-cli issue.
@liquidaty actually, is the error you hit coming from COPY INTO
? If so, you may need to talk to your Databricks support. That one is a data ingestion feature provided by Databricks Runtime. It's not a part of the Delta Lake project. We are discussing to build a similar feature inside Delta Lake ( https://github.com/delta-io/delta/issues/1354 ), but that's a different story.
@zsxwing I believe so-- it's from a COPY INTO that is invoked using databricks-sql-cli, and I would assume that the cli is simply passing the COPY INTO on to databricks which in turn is passing it on to Spark. Good to know about #1354, thank you
@liquidaty Thanks for confirming it. The COPY INFO command is not a part of the Delta Lake project. Please talk to your Databricks support as the Delta Lake project has nothing to do with that.