presto
presto copied to clipboard
On deleting a column from hudi table, its still present when querying from presto using hive connector
On deleting a column from hudi table, its still present when querying from presto using hive connector. All the rows have value null for the deleted column but it is present in output when querying select *. Also, on doing describe table, i see that column is still present
Your Environment
hudi-bundle: org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.1 to create a hudi table syncing to metastore spark: 3.3.2
- Presto version used: 0.281.1
- Storage (HDFS/S3/GCS..): GCS
- Data source and connector used: hive connector
- Deployment (Cloud or On-prem): GCP dataproc
- Pastebin link to the complete debug logs:
Expected Behavior
On deleting hudi column, i expected that column should not be present when querying from presto
Current Behavior
<All the rows have value null for the deleted column but it is present in output when querying select *. Also, on doing describe table, i see that column is still present
Possible Solution
TBD
Steps to Reproduce
- Create dataproc cluster and make following changes spark-default.conf spark.serializer=org.apache.spark.serializer.KryoSerializer spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED spark.sql.legacy.avro.datetimeRebaseModeInWrite=CORRECTED spark.sql.legacy.avro.datetimeRebaseModeInRead=CORRECTED
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar
hive-site.xml hive.metastore.disallow.incompatible.col.type.changes = false
- Create a table from pyspark in hudi: from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType from datetime import datetime schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("surname", StringType(), True), StructField("ts", TimestampType(), True) # Adding timestamp field ]) data = [ (1, "John", "Doe", datetime.now()), (2, "Jane", "Smith", datetime.now()), (3, "Michael", "Johnson", datetime.now()), (4, "Emily", "Williams", datetime.now()) ]
df = spark.createDataFrame(data, schema)
df = spark.createDataFrame(data, schema)
df.write
.format("org.apache.hudi")
.option("hoodie.table.name", "hoodie_table")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.keyprefix", "ts")
.option("hoodie.schema.on.read.enable","true")
.mode("overwrite")
.save("gs://xxxx/subham_test_metastore_13")
spark.sql("CREATE TABLE default.subham_test_metastore_13 USING hudi LOCATION 'gs://xxxx/subham_test_metastore_13' ")
- Create spark-sql engine as pyspark use datasource v1 set hoodie.schema.on.read.enable=true
ALTER TABLE default.subham_test_metastore_11 DROP COLUMN surname;
On doing this, we get this error message as well, but data is getting dropped from hudi.
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :
4. update a row in table from pyspark
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("ts", TimestampType(), True) # Adding timestamp field
])
data = [
(1,"Johny", datetime.now())
]
df = spark.createDataFrame(data, schema)
df.write
.format("org.apache.hudi")
.option("hoodie.table.name", "hoodie_table")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.keyprefix", "ts")
.option("hoodie.schema.on.read.enable","true")
.mode("append")
.save("gs://xxxx/subham_test_metastore_13")
- Now, from spark-sql and pyspark, we can see that column is not coming but it is coming when querying from presto.
Screenshots (if appropriate)
Context
We have lakehouse in Hudi and use presto with hive-connector to query hudi table. We want to delete column and facing problem there.