hudi
hudi copied to clipboard
Hudi Delta Streamer unable to read Older Dates
Hudi Delta Streamer unable to read dates that are older than older than 1900-01-01.The workaround fix for this is to set the following spark configurations :
spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED.
This options work fine when I try to create Hudi table with PySpark. However, when I run CDC process with DeltaStreamer I still continue to get this error. Please note that I cannot use the hudi- transformer class becuase for transformer class to be applied, delta streamer should read the source files. Delta streamer is not able to read the source files.
The error message is "An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is
@alexeykudinkin one more issue here related to dates older than 1900-01-01
@alexeykudinkin based on the discussion in Slack, is there a solution to it?
CC @rmahindra123 gentle ping @alexeykudinkin
@nsivabalan let's merge this one w/ https://github.com/apache/hudi/issues/6278
I've put up https://github.com/apache/hudi/pull/6352 to address this, but didn't hear back from the original reporter(s) whether they were able to try it out and if it resolved the issue.
@alexeykudinkin @nsivabalan can you please let me know what should be tried to see if the issue is resolved? or can you please. I have responded in the ticket #6278
@brskiran1 : can you try out https://github.com/apache/hudi/pull/6352 and let us know if the issue is resolved.
Since we got the patch landed, going ahead and closing the issue out. feel free to open new issue if you need further assistance.