hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Can't read a table with timestamp based partition key generator

Open ofinchuk-bloomberg opened this issue 1 year ago • 1 comments

Can't read a table which was created using TimestampBasedKeyGenerator Or CustomKeyGenerator for timestamp partition. Issue is that ts remains Long type, while _hoodie_partition_path is formed as a String, so Simple operation to read doesn't work and throws Exception

To Reproduce

Steps to reproduce the behavior:

` import org.apache.spark.sql.{SaveMode, SparkSession}

object SprkDemo {

def main(args:Array[String]): Unit ={

    val spark = SparkSession.builder()
        .master("local[1]")
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
        .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
        .appName("SparkByExample")
        .getOrCreate();
    
    import spark.implicits._
    spark.createDataset(List(("id1","name1", System.currentTimeMillis()),
    ("id2","name2",(System.currentTimeMillis()+1))
    ))
        .toDF("id","name","ts")
        .write
        .format("hudi")
        .option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.CustomKeyGenerator")
        .option("hoodie.datasource.write.partitionpath.field", "ts:timestamp")
        .option("hoodie.datasource.write.recordkey.field", "id")
        .option("hoodie.datasource.write.precombined.field", "name")
        .option("hoodie.table.name", "hudi_cow2")
        .option("hoodie.keygen.timebased.timestamp.type", "EPOCHMILLISECONDS")
        .option("hoodie.keygen.timebased.output.dateformat", "yyyyMMdd-HH")
        .mode(SaveMode.Overwrite)
        .save("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2")

    spark.read.parquet("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/2*")
        .show()
    
    spark.read.format("hudi")
        .option("hoodie.schema.on.read.enable","true")
        .load("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/")
        .show()

}

} ` when reading parquet I see next data:

+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id| name|           ts|            date|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+
|  20240214184652987|20240214184652987...|               id1|           20240214-18|9d4eb7eb-847a-4e1...|id1|name1|1707954411089|2024-02-14 15:00|
|  20240214184652987|20240214184652987...|               id2|           20240214-18|9d4eb7eb-847a-4e1...|id2|name2|1707954411090|2024-02-14 15:01|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+

Expected behavior

Table should be read successfully into spark dataframe

Environment Description

I use spark 3.3.3 and hudi-spark3.3-bundle_2.12:0.14.1 in local environment

  • Running on Docker? (yes/no) :no

Stacktrace

Exception in thread "main" java.lang.RuntimeException: Failed to cast value '20240214-18' to 'LongType' for partition column 'ts'
	at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$3(Spark3ParsePartitionUtil.scala:78)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:71)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.parsePartition(Spark3ParsePartitionUtil.scala:69)
	at org.apache.hudi.HoodieSparkUtils$.parsePartitionPath(HoodieSparkUtils.scala:280)
	at org.apache.hudi.HoodieSparkUtils$.parsePartitionColumnValues(HoodieSparkUtils.scala:264)
	at org.apache.hudi.SparkHoodieTableFileIndex.doParsePartitionColumnValues(SparkHoodieTableFileIndex.scala:401)
	at org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:364)
	at org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$7(BaseHoodieTableFileIndex.java:333)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
	at org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:336)
	at org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:216)
	at org.apache.hudi.SparkHoodieTableFileIndex.listMatchingPartitionPaths(SparkHoodieTableFileIndex.scala:219)
	at org.apache.hudi.HoodieFileIndex.getFileSlicesForPrunedPartitions(HoodieFileIndex.scala:282)
	at org.apache.hudi.HoodieFileIndex.filterFileSlices(HoodieFileIndex.scala:211)
	at org.apache.hudi.HoodieFileIndex.listFiles(HoodieFileIndex.scala:151)

ofinchuk-bloomberg avatar Feb 15 '24 18:02 ofinchuk-bloomberg

@ofinchuk-bloomberg Thanks for raising this. I was able to reproduce. Checking it is same as one other known issue which was fixed by https://github.com/apache/hudi/pull/9863 which was later reverted due to little performance regression.

ad1happy2go avatar Feb 16 '24 10:02 ad1happy2go

@ofinchuk-bloomberg This is same issue as https://issues.apache.org/jira/browse/HUDI-4818 .

It is still reproducible on Will try to get this fix prioritized.

ad1happy2go avatar Feb 29 '24 08:02 ad1happy2go

@mansipp is this similar to what we saw when reading Hudi tables via Trino?

CTTY avatar Mar 02 '24 00:03 CTTY