delta icon indicating copy to clipboard operation
delta copied to clipboard

[Feature Request] Support data type changes for schema evolution

Open frankyangdev opened this issue 2 years ago • 10 comments

Feature request

Overview

Currently, schema changes are eligible for schema evolution during table appends or overwrites:

  • Adding new columns
  • Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType

Is it possible to support more data type changes during append operation with enabled schema evolution?

Motivation

For the history table updates, we want to keep all changed records as a newly appended record instead of overwriting schema when the data type has been changed from source with enabled schema evolution

For example, when the scale value of the decimal type is changed from 2 to 4 and the precision is kept unchanged

//i.e. data type of one column is changed from decimal(38,2) to decimal(38,4)
df.write.format("delta").option("mergeSchema", "true").mode("append").save(targetPath)

The error is Failed to merge decimal types with incompatible scale 2 and 4

Can this decimal scale change be supported in Delta schema evolution during table appends and please review other data types as well?

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • [ ] Yes. I can contribute this feature independently.
  • [x] Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • [ ] No. I cannot contribute this feature at this time.

frankyangdev avatar May 04 '22 07:05 frankyangdev

Thanks @frankyangdev as per our Slack conversations, this one may be a little tricky. Let me summarize some key points and add some additional points thanks to @zsxwing

  • Reviewing the Parquet Logical Type Definitions, it seems that decimals are stored as logical data types backed by integers per Data types in Apache Parquet - thanks @bartosz25
  • When reviewing Spark supported data types I was reminded that right now Delta Lake conversion support is ByteType -> ShortType -> IntegerType. We currently don’t support IntegerType -> LongType. This may be coming into play here - in that Spark is using java.math.BigDecimal to support the conversion but Parquet itself is storing things using the various primitives.
  • To add to it, when reviewing ParquetSchemaConverter.scala L589-L611, this may not be a simple change. Decimal may use different underlying types. When a column has different types in different parquet files, there may be issues to read it back directly.

Saying this, if others can chime in and perhaps I may be missing something here and perhaps over-complicating things?

dennyglee avatar May 06 '22 01:05 dennyglee

Thanks @dennyglee for the checking and explanation. I hope this data type change of decimal can be supported in schema evolution in the future. Cheers

frankyangdev avatar May 06 '22 14:05 frankyangdev

We currently don’t support IntegerType -> LongType

Huh... Why is that really? I naively assumed it should be just the same as ByteType -> ShortType -> IntegerType.

It just so happens that this is the exact error that led me to find this issue:

# df schema is [id: bigint]
df.write.format("delta").save(path)
# new_df schema is [id: int, name: string]
new_df.write.format("delta").mode("append").option("mergeSchema", "true").save(path)
# error: Failed to merge fields 'id' and 'id'. Failed to merge incompatible data types LongType and IntegerType

What's interesting is that the following workaround actually works:

DeltaTable.forPath(spark, path).merge(new_df, "false").whenNotMatchedInsertAll().execute()
# schema after this command: [id: bigint, name: string]

zmeir avatar May 17 '22 11:05 zmeir

Huh... Why is that really? I naively assumed it should be just the same as ByteType -> ShortType -> IntegerType.

That's not the same, unfortunately. ByteType, ShortType and IntegerType are all stored using int32 in the physical parquet files, but LongType is stored using int64.

DeltaTable.forPath(spark, path).merge(new_df, "false").whenNotMatchedInsertAll().execute()

This sounds a bug to me. Could you open a new issue with a small reproduction?

zsxwing avatar May 18 '22 20:05 zsxwing

I don't understand these two comments, especially taken together:

We currently don’t support IntegerType -> LongType

ByteType, ShortType and IntegerType are all stored using int32 in the physical parquet files, but LongType is stored using int64

Why exactly would conversion from int32 to in64 be unsupported-- is that not a bug? Even conversion from int64 to int32 should not be a problem if the original value does not exceed int32 limits.

liquidaty avatar Sep 09 '22 17:09 liquidaty

Why exactly would conversion from int32 to in64 be unsupported-- is that not a bug? Even conversion from int64 to int32 should not be a problem if the original value does not exceed int32 limits.

Supporting int32 to int64 will require a non trivial change because this is not supported by parquet and we have to build a layer on top of parquet in order to support this. We consider this is a feature request rather than a bug.

zsxwing avatar Sep 13 '22 18:09 zsxwing

Thanks for the update. I don't follow the logic ending with not-a-bug, but I'm not looking to argue that and appreciate the response.

This issue appears possibly related to: https://github.com/databricks/databricks-sql-cli/issues/39

in which it seems not possible to import data from a CSV file into a bigint column.

  1. does anyone know if that other issue can be solved without solving this issue?
  2. Is there any generally recommended utility or approach to, very simply, importing a CSV file into a spark cluster (and not running into this IntegerType -> LongType problem when loading into a bigint column)?

Perhaps these questions should be directed elsewhere; in any case any suggestions would be much appreciated!

liquidaty avatar Sep 13 '22 18:09 liquidaty

@liquidaty what you are talking about is not this issue. This issue is for automatically changing the column type such as from int32 to int64. Your issue is writing an int32 value to a column whose type is int64. Could you create a separate issue to discuss instead?

zsxwing avatar Sep 13 '22 18:09 zsxwing

@zsxwing thank you-- OK, will open a separate issue. It seemed related because the "Failed to merge incompatible data types LongType and IntegerType" error noted in this thread by @zmeir is the same one described in the databricks-sql-cli issue.

liquidaty avatar Sep 13 '22 19:09 liquidaty

@liquidaty actually, is the error you hit coming from COPY INTO? If so, you may need to talk to your Databricks support. That one is a data ingestion feature provided by Databricks Runtime. It's not a part of the Delta Lake project. We are discussing to build a similar feature inside Delta Lake ( https://github.com/delta-io/delta/issues/1354 ), but that's a different story.

zsxwing avatar Sep 13 '22 19:09 zsxwing

@zsxwing I believe so-- it's from a COPY INTO that is invoked using databricks-sql-cli, and I would assume that the cli is simply passing the COPY INTO on to databricks which in turn is passing it on to Spark. Good to know about #1354, thank you

liquidaty avatar Sep 14 '22 19:09 liquidaty

@liquidaty Thanks for confirming it. The COPY INFO command is not a part of the Delta Lake project. Please talk to your Databricks support as the Delta Lake project has nothing to do with that.

zsxwing avatar Sep 14 '22 19:09 zsxwing