spark-rapids
spark-rapids copied to clipboard
[SPARK-39469][SQL] Infer date type for CSV schema inference
Is your feature request related to a problem? Please describe.
Add a new inferDate
option to CSV Options. The description is:
Whether or not to infer columns that satisfy the
dateFormat
option asDate
. RequiresinferSchema
to be true. Whenfalse
, columns with dates will be inferred asString
(or asTimestamp
if it fits thetimestampFormat
) Legacy date formats inTimestamp
columns cannot be parsed with this option. TheinferDate
option should prevent performance degradation for users who don't opt-in.
InferField
in CSVInferSchema.scala
is modified to include Date type.
If
typeSoFar
ininferField
is Date, Timestamp or TimstampNTZ, we will first attempt to parse Date and then parse Timestamp/TimestampNTZ. The reason why we attempt to parse date fortypeSoFar
=Timestamp/TimestampNTZ is because of the case where a column contains a timestamp entry and then a date entry - we should detect both of the data types and infer the column as a timestamp type.
Summary of the new behavior:
The new behavior of schema inference when inferDate = true
:
- If a column contains only dates, it should be of “date” type in the inferred schema --> If the date format and the timestamp format are identical (e.g. both are yyyy/mm/dd), entries will default to being interpreted as Date
- If a column contains dates and timestamps, it should be of “timestamp” type in the inferred schema
Additional context https://github.com/apache/spark/commit/c2536a7eab Followup: https://github.com/apache/spark/commit/31ab8bc4d5