velox icon indicating copy to clipboard operation
velox copied to clipboard

Enable partial date input support for from_iso8601_date()

Open svm1 opened this issue 10 months ago • 4 comments

svm1 avatar Apr 03 '24 18:04 svm1

Deploy Preview for meta-velox canceled.

Name Link
Latest commit dd2f22cf5f32e50bdd5a01cfe2f35929f57c1335
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/661892c923bd4800081b8820

netlify[bot] avatar Apr 03 '24 18:04 netlify[bot]

@mbasmanova opened this PR to address your comment regarding partial date input handling. Turned out to be a simple fix, taken care of with a call to an alternate date parse function.

svm1 avatar Apr 03 '24 18:04 svm1

@mbasmanova Thanks for the review. I won't lie, I was initially thrown off by that name myself - I agree that it's a bit misleading.

Passing false for isIso8601 is what enables non-strict parsing of partial dates. I can definitely expand on the documentation to make it more clear, perhaps renaming that parameter would be helpful too?

svm1 avatar Apr 03 '24 18:04 svm1

Passing false for isIso8601 is what enables non-strict parsing of partial dates. ..., perhaps renaming that parameter would be helpful too?

I would be nice to rename this parameter to reflect its meaning more accurately. Just to make sure, would you double check that supported formats are exactly the same as in Presto (or at least a subset, but not a superset)?

I can definitely expand on the documentation to make it more clear

That would be great.

mbasmanova avatar Apr 03 '24 18:04 mbasmanova

@mbasmanova Your concern was valid - when testing the change, I noticed that along with allowing partial date input, the flag I enabled also allows ISO date strings to include the time component - this doesn't seem to be the case with Presto Java.

The ParseMode::kNonStandardCast flag (definition below) which lets these values through is used elsewhere (particularly for Spark support), so modifying that wouldn't be a good idea - it may be worth creating a new field in the ParseMode enum to handle this specific case (to allow strings containing partial date inputs but no timestamps). What do you say?

// kNonStandardCast: Like standard but permits missing day/month and allows
// trailing 'T' or spaces. Align with Spark SQL casting conventions.

svm1 avatar Apr 09 '24 23:04 svm1

@svm1 Thank you for taking a closer look.

it may be worth creating a new field in the ParseMode enum to handle this specific case (to allow strings containing partial date inputs but no timestamps)

This sounds reasonable to me. CC: @rui-mo @PHILO-HE @pedroerp

mbasmanova avatar Apr 10 '24 11:04 mbasmanova

@mbasmanova Now that I think about it, adding to the enum wouldn't quite be sufficient, due to the way the cast function is structured. The function itself needs to be modified or duplicated.

int32_t castFromDateString(const char* str, size_t len, bool strictParse) {
  int64_t daysSinceEpoch;
  size_t pos = 0;

  auto mode =
      strictParse ? ParseMode::kStandardCast : ParseMode::kNonStandardCast;
  if (!tryParseDateString(str, len, pos, daysSinceEpoch, mode)) {
      ...

castFromDateString() is already being called with strictParse = false elsewhere (SparkCastHooks) - so directing !strictParse to a different ParseMode would be just as problematic as modifying the existing kNonStandardCast mode itself.

To address this issue, as well as any similar needs that may arise in the future - wouldn't it be better if castFromDateString() takes a ParseMode parameter directly? Abstracting ParseMode away from the caller just seems to add unnecessary complications. Though this would require moving the enum to a header file that each caller can access.

svm1 avatar Apr 10 '24 19:04 svm1

wouldn't it be better if castFromDateString() takes a ParseMode parameter directly?

I think this is a good idea. An enum would make it easier to use the API because it will be clearer than 'true/false'.

CC: @rui-mo @PHILO-HE

mbasmanova avatar Apr 10 '24 19:04 mbasmanova

CC @majetideepak

svm1 avatar Apr 10 '24 20:04 svm1

@mbasmanova I've refactored castFromDateString() to use the enum directly. Updated docs and tests accordingly as well, please take a look.

svm1 avatar Apr 11 '24 01:04 svm1

@mbasmanova @svm1 Thanks for noticing. I created a test on partial input, and found the behavior was not aligned for input like 123. Spark throws error or returns null for 123, but 2020 works. While in Velox, 123 and 2020 both work. We can follow-up in a separate PR to fix Spark cast. What do you think?

scala> spark.sql("select cast('123' as date)").show(false)
+-----------------+                                                             
|CAST(123 AS DATE)|
+-----------------+
|null             |
+-----------------+


scala> spark.sql("select cast('2020' as date)").show(false)
+------------------+                                                            
|CAST(2020 AS DATE)|
+------------------+
|2020-01-01        |
+------------------+

rui-mo avatar Apr 11 '24 01:04 rui-mo

@mbasmanova Thanks for all your feedback. Cleaned up and elaborated upon the docs, please let me if it looks good to you!

svm1 avatar Apr 11 '24 22:04 svm1

Would you rebase if you haven't already?

mbasmanova avatar Apr 11 '24 23:04 mbasmanova

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 11 '24 23:04 facebook-github-bot

@mbasmanova merged this pull request in facebookincubator/velox@115a240c2b7b6766837037daa79276929e015cc7.

facebook-github-bot avatar Apr 13 '24 00:04 facebook-github-bot

Conbench analyzed the 1 benchmark run on commit 115a240c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

conbench-facebook[bot] avatar Apr 13 '24 01:04 conbench-facebook[bot]