hudi
hudi copied to clipboard
[SUPPORT] Can't read a table with timestamp based partition key generator
Can't read a table which was created using TimestampBasedKeyGenerator Or CustomKeyGenerator for timestamp partition.
Issue is that ts
remains Long type, while _hoodie_partition_path is formed as a String, so Simple operation to read doesn't work and throws Exception
To Reproduce
Steps to reproduce the behavior:
` import org.apache.spark.sql.{SaveMode, SparkSession}
object SprkDemo {
def main(args:Array[String]): Unit ={
val spark = SparkSession.builder()
.master("local[1]")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
.appName("SparkByExample")
.getOrCreate();
import spark.implicits._
spark.createDataset(List(("id1","name1", System.currentTimeMillis()),
("id2","name2",(System.currentTimeMillis()+1))
))
.toDF("id","name","ts")
.write
.format("hudi")
.option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.CustomKeyGenerator")
.option("hoodie.datasource.write.partitionpath.field", "ts:timestamp")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.precombined.field", "name")
.option("hoodie.table.name", "hudi_cow2")
.option("hoodie.keygen.timebased.timestamp.type", "EPOCHMILLISECONDS")
.option("hoodie.keygen.timebased.output.dateformat", "yyyyMMdd-HH")
.mode(SaveMode.Overwrite)
.save("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2")
spark.read.parquet("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/2*")
.show()
spark.read.format("hudi")
.option("hoodie.schema.on.read.enable","true")
.load("/Users/ofinchuk/tools/workspace/hudi/hudi_cow2/")
.show()
}
} ` when reading parquet I see next data:
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id| name| ts| date|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+
| 20240214184652987|20240214184652987...| id1| 20240214-18|9d4eb7eb-847a-4e1...|id1|name1|1707954411089|2024-02-14 15:00|
| 20240214184652987|20240214184652987...| id2| 20240214-18|9d4eb7eb-847a-4e1...|id2|name2|1707954411090|2024-02-14 15:01|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-----+-------------+----------------+
Expected behavior
Table should be read successfully into spark dataframe
Environment Description
I use spark 3.3.3 and hudi-spark3.3-bundle_2.12:0.14.1 in local environment
- Running on Docker? (yes/no) :no
Stacktrace
Exception in thread "main" java.lang.RuntimeException: Failed to cast value '20240214-18' to 'LongType' for partition column 'ts'
at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$3(Spark3ParsePartitionUtil.scala:78)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:71)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil$.parsePartition(Spark3ParsePartitionUtil.scala:69)
at org.apache.hudi.HoodieSparkUtils$.parsePartitionPath(HoodieSparkUtils.scala:280)
at org.apache.hudi.HoodieSparkUtils$.parsePartitionColumnValues(HoodieSparkUtils.scala:264)
at org.apache.hudi.SparkHoodieTableFileIndex.doParsePartitionColumnValues(SparkHoodieTableFileIndex.scala:401)
at org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:364)
at org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$7(BaseHoodieTableFileIndex.java:333)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:336)
at org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:216)
at org.apache.hudi.SparkHoodieTableFileIndex.listMatchingPartitionPaths(SparkHoodieTableFileIndex.scala:219)
at org.apache.hudi.HoodieFileIndex.getFileSlicesForPrunedPartitions(HoodieFileIndex.scala:282)
at org.apache.hudi.HoodieFileIndex.filterFileSlices(HoodieFileIndex.scala:211)
at org.apache.hudi.HoodieFileIndex.listFiles(HoodieFileIndex.scala:151)
@ofinchuk-bloomberg Thanks for raising this. I was able to reproduce. Checking it is same as one other known issue which was fixed by https://github.com/apache/hudi/pull/9863 which was later reverted due to little performance regression.
@ofinchuk-bloomberg This is same issue as https://issues.apache.org/jira/browse/HUDI-4818 .
It is still reproducible on Will try to get this fix prioritized.
@mansipp is this similar to what we saw when reading Hudi tables via Trino?