spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

Open xiaonanyang-db opened this issue 3 years ago • 7 comments

What changes were proposed in this pull request?

This PR changes the behavior of how columns with mixing dates and timestamps are supported in CSV schema inference and data parsing.

  • If user provides a timestampFormat/timestampNTZFormat, this type of columns will always be inferred as StringType.
  • If no timestampFormat/timestampNTZFormat specified by user, this type of columns could be inferred as
    • TimestampType when the date values can be parsed by the lenient timestampFormatter
    • otherwise StringType

Here are the semantics of the changes:

  • In CSVInferSchema

    • Remove the attempts to infer field as DateType when prefersDate=true and typeSoFar=TimestampType/TimestampNTZType
    • Change the dataType merging behavior between DateType and TimestampType/TimestampNTZType:
      • If the timestampFormat/timestampNTZFormat is given, merge the two types into StringType
      • Otherwise
        • if the dateFormat could be parsed by the lenient timestamp formatter, merge the two types into TimestampType/TimestampNTZType
        • otherwise, merge the two types into StringType
  • In UnivocityParser, remove the attempts to parse field as Date if it failed to be parsed as Timestamp.

The PR does not change the way we support date type in CSV data source, it is still controlled by prefersDate flag for backwards compatibility.

Why are the changes needed?

Simplify CSV dateTime inference logic.

Does this PR introduce any user-facing change?

No compared to the previous PR.

How was this patch tested?

xiaonanyang-db avatar Sep 19 '22 17:09 xiaonanyang-db

Can you update the description to list all of the semantics of the change? You can remove the point where we need to merge them to TimestampType if this is not what the PR implements and replace it with "merging to StringType" instead.

Is it correct that the date inference is still controlled by "prefersDate"?

Sure! Yes, it's still controlled by "prefersDate".

xiaonanyang-db avatar Sep 20 '22 05:09 xiaonanyang-db

Let me re-review the change to use ISO8601 parsing only.

sadikovi avatar Sep 20 '22 06:09 sadikovi

There are many cases to consider here: 1) the CSV data is pure date, pure timestamp, or a mixture. 2) the user specifies datetime pattern or not.

  1. pure date + no datetime pattern: infer as date type
  2. pure timestamp + no datetime pattern: infer as timestamp type
  3. mixture + no datetime pattern: infer as timestamp type
  4. pure date + datetime pattern: if pattern matches, infer as date type, otherwise string type
  5. pure timestamp + datetime pattern: if pattern matches, infer as timestamp type, otherwise string type
  6. mixture + datetime pattern: I think this is where the problem occurs. We will first parse the data as date, if can't, try parse as timestamp. This is very slow as we invoke the formatter twice. I think we shouldn't support mixture of date and timestamp in this case. If prefersDate is true, only try to infer as date type, otherwise only try to infer as timestamp.

cloud-fan avatar Sep 21 '22 07:09 cloud-fan

I think the logic in this PR seems reasonable. If prefersDate = true and we have date and timestamp strings, make the column StringType. If prefersDate was not set, then this could be inferred as timestamp if possible.

I am just not sure about hardcoding date formats in the CSVInferSchema parser.

sadikovi avatar Sep 21 '22 07:09 sadikovi

There are many cases to consider here: 1) the CSV data is pure date, pure timestamp, or a mixture. 2) the user specifies datetime pattern or not.

  1. pure date + no datetime pattern: infer as date type
  2. pure timestamp + no datetime pattern: infer as timestamp type
  3. mixture + no datetime pattern: infer as timestamp type
  4. pure date + datetime pattern: if pattern matches, infer as date type, otherwise string type
  5. pure timestamp + datetime pattern: if pattern matches, infer as timestamp type, otherwise string type
  6. mixture + datetime pattern: I think this is where the problem occurs. We will first parse the data as date, if can't, try parse as timestamp. This is very slow as we invoke the formatter twice. I think we shouldn't support mixture of date and timestamp in this case. If prefersDate is true, only try to infer as date type, otherwise only try to infer as timestamp.

Thanks @cloud-fan. Totally agree with those behaviors, and this PR is exactly making that happen.

xiaonanyang-db avatar Sep 21 '22 07:09 xiaonanyang-db

I think the logic in this PR seems reasonable. If prefersDate = true and we have date and timestamp strings, make the column StringType. If prefersDate was not set, then this could be inferred as timestamp if possible.

I am just not sure about hardcoding date formats in the CSVInferSchema parser.

Thanks @sadikovi.

The logic of this PR you described is not accurate. The actual logic is that

  • if user specifies timestamp format, columns of mixed dates and timestamps will be inferred as String type.
  • If user does not specify any timestamp format, columns with mixed dates and timestamps could be inferred as timestamp if possible, regardless of if prefersDate is set to true or not.

Understand your concern, it's a trade-off of not introducing performance degradation.

xiaonanyang-db avatar Sep 21 '22 07:09 xiaonanyang-db

Can one of the admins verify this patch?

AmplabJenkins avatar Sep 21 '22 14:09 AmplabJenkins

@cloud-fan @HyukjinKwon Do you have any concerns or questions? IMHO, we can merge this PR, seems that all of the questions have been addressed. Thanks.

sadikovi avatar Sep 21 '22 22:09 sadikovi

Can we update PR description to mention that prefersDate is by default true now?

cloud-fan avatar Sep 22 '22 13:09 cloud-fan

thanks, merging to master!

cloud-fan avatar Sep 23 '22 07:09 cloud-fan