presto icon indicating copy to clipboard operation
presto copied to clipboard

On deleting a column from hudi table, its still present when querying from presto using hive connector

Open sutodi opened this issue 9 months ago • 1 comments

On deleting a column from hudi table, its still present when querying from presto using hive connector. All the rows have value null for the deleted column but it is present in output when querying select *. Also, on doing describe table, i see that column is still present

Your Environment

hudi-bundle: org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.1 to create a hudi table syncing to metastore spark: 3.3.2

  • Presto version used: 0.281.1
  • Storage (HDFS/S3/GCS..): GCS
  • Data source and connector used: hive connector
  • Deployment (Cloud or On-prem): GCP dataproc
  • Pastebin link to the complete debug logs:

Expected Behavior

On deleting hudi column, i expected that column should not be present when querying from presto

Current Behavior

<All the rows have value null for the deleted column but it is present in output when querying select *. Also, on doing describe table, i see that column is still present

Possible Solution


Steps to Reproduce

  1. Create dataproc cluster and make following changes spark-default.conf spark.serializer=org.apache.spark.serializer.KryoSerializer spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED spark.sql.legacy.avro.datetimeRebaseModeInWrite=CORRECTED spark.sql.legacy.avro.datetimeRebaseModeInRead=CORRECTED

spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar

hive-site.xml hive.metastore.disallow.incompatible.col.type.changes = false

  1. Create a table from pyspark in hudi: from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType from datetime import datetime schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("surname", StringType(), True), StructField("ts", TimestampType(), True) # Adding timestamp field ]) data = [ (1, "John", "Doe",, (2, "Jane", "Smith",, (3, "Michael", "Johnson",, (4, "Emily", "Williams", ]

df = spark.createDataFrame(data, schema) df = spark.createDataFrame(data, schema) df.write
.option("", "hoodie_table")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.keyprefix", "ts")
.save("gs://xxxx/subham_test_metastore_13") spark.sql("CREATE TABLE default.subham_test_metastore_13 USING hudi LOCATION 'gs://xxxx/subham_test_metastore_13' ")

  1. Create spark-sql engine as pyspark use datasource v1 set

ALTER TABLE default.subham_test_metastore_11 DROP COLUMN surname;

On doing this, we get this error message as well, but data is getting dropped from hudi. Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions : 4. update a row in table from pyspark schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("ts", TimestampType(), True) # Adding timestamp field ]) data = [ (1,"Johny", ] df = spark.createDataFrame(data, schema) df.write
.option("", "hoodie_table")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.keyprefix", "ts")

  1. Now, from spark-sql and pyspark, we can see that column is not coming but it is coming when querying from presto.

Screenshots (if appropriate)


We have lakehouse in Hudi and use presto with hive-connector to query hudi table. We want to delete column and facing problem there.

sutodi avatar May 09 '24 08:05 sutodi