spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[SPARK-39469][SQL] Infer date type for CSV schema inference

Open amahussein opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe.

Add a new inferDate option to CSV Options. The description is:

Whether or not to infer columns that satisfy the dateFormat option as Date. Requires inferSchema to be true. When false, columns with dates will be inferred as String (or as Timestamp if it fits the timestampFormat) Legacy date formats in Timestamp columns cannot be parsed with this option. The inferDate option should prevent performance degradation for users who don't opt-in.

InferField in CSVInferSchema.scala is modified to include Date type.

If typeSoFar in inferField is Date, Timestamp or TimstampNTZ, we will first attempt to parse Date and then parse Timestamp/TimestampNTZ. The reason why we attempt to parse date for typeSoFar=Timestamp/TimestampNTZ is because of the case where a column contains a timestamp entry and then a date entry - we should detect both of the data types and infer the column as a timestamp type.

Summary of the new behavior:

The new behavior of schema inference when inferDate = true:

  1. If a column contains only dates, it should be of “date” type in the inferred schema --> If the date format and the timestamp format are identical (e.g. both are yyyy/mm/dd), entries will default to being interpreted as Date
  2. If a column contains dates and timestamps, it should be of “timestamp” type in the inferred schema

Additional context https://github.com/apache/spark/commit/c2536a7eab Followup: https://github.com/apache/spark/commit/31ab8bc4d5

amahussein avatar Jul 25 '22 15:07 amahussein