velox
velox copied to clipboard
Enable partial date input support for from_iso8601_date()
Deploy Preview for meta-velox canceled.
Name | Link |
---|---|
Latest commit | dd2f22cf5f32e50bdd5a01cfe2f35929f57c1335 |
Latest deploy log | https://app.netlify.com/sites/meta-velox/deploys/661892c923bd4800081b8820 |
@mbasmanova opened this PR to address your comment regarding partial date input handling. Turned out to be a simple fix, taken care of with a call to an alternate date parse function.
@mbasmanova Thanks for the review. I won't lie, I was initially thrown off by that name myself - I agree that it's a bit misleading.
Passing false
for isIso8601
is what enables non-strict parsing of partial dates. I can definitely expand on the documentation to make it more clear, perhaps renaming that parameter would be helpful too?
Passing false for isIso8601 is what enables non-strict parsing of partial dates. ..., perhaps renaming that parameter would be helpful too?
I would be nice to rename this parameter to reflect its meaning more accurately. Just to make sure, would you double check that supported formats are exactly the same as in Presto (or at least a subset, but not a superset)?
I can definitely expand on the documentation to make it more clear
That would be great.
@mbasmanova Your concern was valid - when testing the change, I noticed that along with allowing partial date input, the flag I enabled also allows ISO date strings to include the time component - this doesn't seem to be the case with Presto Java.
The ParseMode::kNonStandardCast
flag (definition below) which lets these values through is used elsewhere (particularly for Spark support), so modifying that wouldn't be a good idea - it may be worth creating a new field in the ParseMode enum to handle this specific case (to allow strings containing partial date inputs but no timestamps). What do you say?
// kNonStandardCast: Like standard but permits missing day/month and allows
// trailing 'T' or spaces. Align with Spark SQL casting conventions.
@svm1 Thank you for taking a closer look.
it may be worth creating a new field in the ParseMode enum to handle this specific case (to allow strings containing partial date inputs but no timestamps)
This sounds reasonable to me. CC: @rui-mo @PHILO-HE @pedroerp
@mbasmanova Now that I think about it, adding to the enum wouldn't quite be sufficient, due to the way the cast function is structured. The function itself needs to be modified or duplicated.
int32_t castFromDateString(const char* str, size_t len, bool strictParse) {
int64_t daysSinceEpoch;
size_t pos = 0;
auto mode =
strictParse ? ParseMode::kStandardCast : ParseMode::kNonStandardCast;
if (!tryParseDateString(str, len, pos, daysSinceEpoch, mode)) {
...
castFromDateString() is already being called with strictParse = false
elsewhere (SparkCastHooks) - so directing !strictParse to a different ParseMode would be just as problematic as modifying the existing kNonStandardCast mode itself.
To address this issue, as well as any similar needs that may arise in the future - wouldn't it be better if castFromDateString() takes a ParseMode parameter directly? Abstracting ParseMode away from the caller just seems to add unnecessary complications. Though this would require moving the enum to a header file that each caller can access.
wouldn't it be better if castFromDateString() takes a ParseMode parameter directly?
I think this is a good idea. An enum would make it easier to use the API because it will be clearer than 'true/false'.
CC: @rui-mo @PHILO-HE
CC @majetideepak
@mbasmanova I've refactored castFromDateString()
to use the enum directly. Updated docs and tests accordingly as well, please take a look.
@mbasmanova @svm1 Thanks for noticing. I created a test on partial input, and found the behavior was not aligned for input like 123
. Spark throws error or returns null for 123
, but 2020
works. While in Velox, 123
and 2020
both work. We can follow-up in a separate PR to fix Spark cast. What do you think?
scala> spark.sql("select cast('123' as date)").show(false)
+-----------------+
|CAST(123 AS DATE)|
+-----------------+
|null |
+-----------------+
scala> spark.sql("select cast('2020' as date)").show(false)
+------------------+
|CAST(2020 AS DATE)|
+------------------+
|2020-01-01 |
+------------------+
@mbasmanova Thanks for all your feedback. Cleaned up and elaborated upon the docs, please let me if it looks good to you!
Would you rebase if you haven't already?
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@mbasmanova merged this pull request in facebookincubator/velox@115a240c2b7b6766837037daa79276929e015cc7.
Conbench analyzed the 1 benchmark run on commit 115a240c
.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details.