hudi [SUPPORT] Error using the property hoodie.datasource.write.drop.partition.columns

Hi. I am developing a process to ingest data from my hdfs using Hudi. I want to partition the data using a custom keygenerator class where the partition key will be a tuple columnName@NumPartitions. Then, in my custom keygenerator using the function module to send the row to a partition or another.

The initial load is the following:

spark.read.option("mergeSchema","true").parquet("PATH"). withColumn("_hoodie_is_deleted", lit(false)). write.format("hudi"). option(OPERATION_OPT_KEY, "upsert"). option(CDC_ENABLED.key(), "true"). option(TABLE_NAME, tableName). option("hoodie.datasource.write.payload.class","CustomOverwriteWithLatestAvroPayload"). option("hoodie.avro.schema.validate","false"). option("hoodie.datasource.write.recordkey.field","CID"). option("hoodie.datasource.write.precombine.field","sequential_total"). option("hoodie.datasource.write.new.columns.nullable", "true"). option("hoodie.datasource.write.reconcile.schema","true"). option("hoodie.metadata.enable","false"). option("hoodie.index.type","SIMPLE"). option("hoodie.datasource.write.table.type","COPY_ON_WRITE"). option("hoodie.datasource.write.keygenerator.class","CustomKeyGenerator"). option("hoodie.datasource.write.partitionpath.field","CID@12"). option("hoodie.datasource.write.drop.partition.columns","true"). mode(Overwrite). save("/tmp/hudi2")

I have added the property hoodie.datasource.write.drop.partition.columns because when I read the final path, hudi throws me the error: Cannot find columns: 'CID@12' in the schema But with this property, It does not work either. The error that appears is the following:

org.apache.hudi.internal.schema.HoodieSchemaException: Failed to fetch schema from the table at org.apache.hudi.HoodieBaseRelation.$anonfun$x$2$10(HoodieBaseRelation.scala:179) at scala.Option.getOrElse(Option.scala:189) at org.apache.hudi.HoodieBaseRelation.x$2$lzycompute(HoodieBaseRelation.scala:175) at org.apache.hudi.HoodieBaseRelation.x$2(HoodieBaseRelation.scala:151) at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt$lzycompute(HoodieBaseRelation.scala:151) at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt(HoodieBaseRelation.scala:151) at org.apache.hudi.BaseFileOnlyRelation.(BaseFileOnlyRelation.scala:69) at org.apache.hudi.DefaultSource$.resolveBaseFileOnlyRelation(DefaultSource.scala:321) at org.apache.hudi.DefaultSource$.createRelation(DefaultSource.scala:262) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:118) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:74) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188) ... 63 elided

May 03 '24 13:05 ghost

hoodie.datasource.write.drop.partition.columns is setup by default as false, which means the data file does not include the partition columns, the partition field you declared here should be a field name instead of a value CID@12.

May 06 '24 00:05 danny0405

And is there any way of partitioning the data using a hash function of the row primary key to improve the performance for update rows. I have developed my custom BuiltinKeyGenerator overwriting the method getPartitionPath (I get the partitionPath, which is the primary key and I apply the operation % numBuckets) but the problem is that when I read the data, the value for the primary key column is the value of the operation instead of the real value.

May 06 '24 13:05 ghost

The contract here is: the partition field shoud be in the table schema anyway.

May 07 '24 01:05 danny0405

hudi hudi copied to clipboard

[SUPPORT] Error using the property hoodie.datasource.write.drop.partition.columns

hudi
hudi copied to clipboard