hudi icon indicating copy to clipboard operation
hudi copied to clipboard

Hudi Delta Streamer unable to read Older Dates

Open SubashRanganathan opened this issue 2 years ago • 2 comments

Hudi Delta Streamer unable to read dates that are older than older than 1900-01-01.The workaround fix for this is to set the following spark configurations :

spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED.

This options work fine when I try to create Hudi table with PySpark. However, when I run CDC process with DeltaStreamer I still continue to get this error. Please note that I cannot use the hudi- transformer class becuase for transformer class to be applied, delta streamer should read the source files. Delta streamer is not able to read the source files.

The error message is "An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is

SubashRanganathan avatar Aug 04 '22 17:08 SubashRanganathan

@alexeykudinkin one more issue here related to dates older than 1900-01-01

brskiran1 avatar Aug 06 '22 04:08 brskiran1

@alexeykudinkin based on the discussion in Slack, is there a solution to it?

yihua avatar Aug 08 '22 05:08 yihua

CC @rmahindra123 gentle ping @alexeykudinkin

nsivabalan avatar Aug 16 '22 04:08 nsivabalan

@nsivabalan let's merge this one w/ https://github.com/apache/hudi/issues/6278

I've put up https://github.com/apache/hudi/pull/6352 to address this, but didn't hear back from the original reporter(s) whether they were able to try it out and if it resolved the issue.

alexeykudinkin avatar Aug 16 '22 17:08 alexeykudinkin

@alexeykudinkin @nsivabalan can you please let me know what should be tried to see if the issue is resolved? or can you please. I have responded in the ticket #6278

brskiran1 avatar Aug 16 '22 17:08 brskiran1

@brskiran1 : can you try out https://github.com/apache/hudi/pull/6352 and let us know if the issue is resolved.

nsivabalan avatar Aug 27 '22 20:08 nsivabalan

Since we got the patch landed, going ahead and closing the issue out. feel free to open new issue if you need further assistance.

nsivabalan avatar Nov 02 '22 07:11 nsivabalan