spark
spark copied to clipboard
[SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
What changes were proposed in this pull request?
This PR changes the behavior of how columns with mixing dates and timestamps are supported in CSV schema inference and data parsing.
- If user provides a
timestampFormat/timestampNTZFormat, this type of columns will always be inferred asStringType. - If no
timestampFormat/timestampNTZFormatspecified by user, this type of columns could be inferred asTimestampTypewhen the date values can be parsed by the lenient timestampFormatter- otherwise
StringType
Here are the semantics of the changes:
-
In
CSVInferSchema- Remove the attempts to infer field as
DateTypewhenprefersDate=trueandtypeSoFar=TimestampType/TimestampNTZType - Change the dataType merging behavior between
DateTypeandTimestampType/TimestampNTZType:- If the
timestampFormat/timestampNTZFormatis given, merge the two types into StringType - Otherwise
- if the
dateFormatcould be parsed by the lenient timestamp formatter, merge the two types intoTimestampType/TimestampNTZType - otherwise, merge the two types into
StringType
- if the
- If the
- Remove the attempts to infer field as
-
In
UnivocityParser, remove the attempts to parse field as Date if it failed to be parsed as Timestamp.
The PR does not change the way we support date type in CSV data source, it is still controlled by prefersDate flag for backwards compatibility.
Why are the changes needed?
Simplify CSV dateTime inference logic.
Does this PR introduce any user-facing change?
No compared to the previous PR.
How was this patch tested?
Can you update the description to list all of the semantics of the change? You can remove the point where we need to merge them to TimestampType if this is not what the PR implements and replace it with "merging to StringType" instead.
Is it correct that the date inference is still controlled by "prefersDate"?
Sure! Yes, it's still controlled by "prefersDate".
Let me re-review the change to use ISO8601 parsing only.
There are many cases to consider here: 1) the CSV data is pure date, pure timestamp, or a mixture. 2) the user specifies datetime pattern or not.
- pure date + no datetime pattern: infer as date type
- pure timestamp + no datetime pattern: infer as timestamp type
- mixture + no datetime pattern: infer as timestamp type
- pure date + datetime pattern: if pattern matches, infer as date type, otherwise string type
- pure timestamp + datetime pattern: if pattern matches, infer as timestamp type, otherwise string type
- mixture + datetime pattern: I think this is where the problem occurs. We will first parse the data as date, if can't, try parse as timestamp. This is very slow as we invoke the formatter twice. I think we shouldn't support mixture of date and timestamp in this case. If
prefersDateis true, only try to infer as date type, otherwise only try to infer as timestamp.
I think the logic in this PR seems reasonable. If prefersDate = true and we have date and timestamp strings, make the column StringType. If prefersDate was not set, then this could be inferred as timestamp if possible.
I am just not sure about hardcoding date formats in the CSVInferSchema parser.
There are many cases to consider here: 1) the CSV data is pure date, pure timestamp, or a mixture. 2) the user specifies datetime pattern or not.
- pure date + no datetime pattern: infer as date type
- pure timestamp + no datetime pattern: infer as timestamp type
- mixture + no datetime pattern: infer as timestamp type
- pure date + datetime pattern: if pattern matches, infer as date type, otherwise string type
- pure timestamp + datetime pattern: if pattern matches, infer as timestamp type, otherwise string type
- mixture + datetime pattern: I think this is where the problem occurs. We will first parse the data as date, if can't, try parse as timestamp. This is very slow as we invoke the formatter twice. I think we shouldn't support mixture of date and timestamp in this case. If
prefersDateis true, only try to infer as date type, otherwise only try to infer as timestamp.
Thanks @cloud-fan. Totally agree with those behaviors, and this PR is exactly making that happen.
I think the logic in this PR seems reasonable. If prefersDate = true and we have date and timestamp strings, make the column StringType. If prefersDate was not set, then this could be inferred as timestamp if possible.
I am just not sure about hardcoding date formats in the CSVInferSchema parser.
Thanks @sadikovi.
The logic of this PR you described is not accurate. The actual logic is that
- if user specifies timestamp format, columns of mixed dates and timestamps will be inferred as String type.
- If user does not specify any timestamp format, columns with mixed dates and timestamps could be inferred as timestamp if possible, regardless of if prefersDate is set to true or not.
Understand your concern, it's a trade-off of not introducing performance degradation.
Can one of the admins verify this patch?
@cloud-fan @HyukjinKwon Do you have any concerns or questions? IMHO, we can merge this PR, seems that all of the questions have been addressed. Thanks.
Can we update PR description to mention that prefersDate is by default true now?
thanks, merging to master!