hudi
hudi copied to clipboard
[SUPPORT] Error using the property hoodie.datasource.write.drop.partition.columns
Hi. I am developing a process to ingest data from my hdfs using Hudi. I want to partition the data using a custom keygenerator class where the partition key will be a tuple columnName@NumPartitions. Then, in my custom keygenerator using the function module to send the row to a partition or another.
The initial load is the following:
spark.read.option("mergeSchema","true").parquet("PATH"). withColumn("_hoodie_is_deleted", lit(false)). write.format("hudi"). option(OPERATION_OPT_KEY, "upsert"). option(CDC_ENABLED.key(), "true"). option(TABLE_NAME, tableName). option("hoodie.datasource.write.payload.class","CustomOverwriteWithLatestAvroPayload"). option("hoodie.avro.schema.validate","false"). option("hoodie.datasource.write.recordkey.field","CID"). option("hoodie.datasource.write.precombine.field","sequential_total"). option("hoodie.datasource.write.new.columns.nullable", "true"). option("hoodie.datasource.write.reconcile.schema","true"). option("hoodie.metadata.enable","false"). option("hoodie.index.type","SIMPLE"). option("hoodie.datasource.write.table.type","COPY_ON_WRITE"). option("hoodie.datasource.write.keygenerator.class","CustomKeyGenerator"). option("hoodie.datasource.write.partitionpath.field","CID@12"). option("hoodie.datasource.write.drop.partition.columns","true"). mode(Overwrite). save("/tmp/hudi2")
I have added the property hoodie.datasource.write.drop.partition.columns because when I read the final path, hudi throws me the error: Cannot find columns: 'CID@12' in the schema But with this property, It does not work either. The error that appears is the following:
org.apache.hudi.internal.schema.HoodieSchemaException: Failed to fetch schema from the table
at org.apache.hudi.HoodieBaseRelation.$anonfun$x$2$10(HoodieBaseRelation.scala:179)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.hudi.HoodieBaseRelation.x$2$lzycompute(HoodieBaseRelation.scala:175)
at org.apache.hudi.HoodieBaseRelation.x$2(HoodieBaseRelation.scala:151)
at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt$lzycompute(HoodieBaseRelation.scala:151)
at org.apache.hudi.HoodieBaseRelation.internalSchemaOpt(HoodieBaseRelation.scala:151)
at org.apache.hudi.BaseFileOnlyRelation.
hoodie.datasource.write.drop.partition.columns is setup by default as false, which means the data file does not include the partition columns, the partition field you declared here should be a field name instead of a value CID@12.
And is there any way of partitioning the data using a hash function of the row primary key to improve the performance for update rows. I have developed my custom BuiltinKeyGenerator overwriting the method getPartitionPath (I get the partitionPath, which is the primary key and I apply the operation % numBuckets) but the problem is that when I read the data, the value for the primary key column is the value of the operation instead of the real value.
The contract here is: the partition field shoud be in the table schema anyway.