velox
velox copied to clipboard
Velox doesn't support legacy date format behavior in Spark SQL
Bug description
For backward capability of parsing/formatting of timestamp/date strings expression behavior and adoption of new behaviors. Spark add a setting spark.sql.legacy.timeParserPolicy to indicate whether using legacy behavior. If it is set to LEGACY, Spark doesn't performs strict checking for expression's input. I see Velox doesn't support this when using unix_timestamp function, there is a related gluten issue : link. Maybe there are more result mismatch cases.
Maybe we should add support for legacy date format behavior. @pedroerp @rui-mo What do you think? Thanks!
@NEUpanning, is Spark's legacy policy widely used in your production environment?
@PHILO-HE Yes. In Spark upgrades, we need to use legacy policy for backward compatibility rather than pushing users to accept the new behavior. Here is part of legacy policies we are using :
spark.sql.storeAssignmentPolicy Legacy
spark.sql.legacy.createArrayOneBasedIndex false
spark.sql.legacy.setCommandRejectsSparkCoreConfs false
spark.sql.legacy.ctePrecedencePolicy LEGACY
spark.sql.legacy.createEmptyCollectionUsingStringType true
spark.sql.legacy.exponentLiteralAsDecimal.enabled true
spark.sql.legacy.parser.havingWithoutGroupByAsWhere true
spark.sql.legacy.timeParserPolicy LEGACY
spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue true
spark.sql.legacy.fromDayTimeString.enabled true
spark.sql.legacy.allowHashOnMapType true
spark.sql.legacy.allowNegativeScaleOfDecimal true
spark.sql.legacy.followThreeValuedLogicInArrayExists false
spark.sql.legacy.json.allowEmptyString.enabled true
spark.sql.legacy.setopsPrecedence.enabled true
spark.sql.legacy.literal.pickMinimumPrecision false
spark.sql.legacy.sizeOfNull true
spark.sql.legacy.typeCoercion.datetimeToString.enabled true
spark.sql.legacy.doLooseUpcast true
spark.sql.legacy.addMonthsAdjustLastDay true
spark.sql.legacy.statisticalAggregate true
@NEUpanning: we have recently added support for different date/timestamp parsing semantics; are these not enough to follow the Spark parsing semantic you described? If not, could you highlight some of the differences?
https://github.com/facebookincubator/velox/blob/main/velox/type/TimestampConversion.h#L48
Cc: @mbasmanova
@pedroerp Thanks for your reply. UnixTimestampParseFunction is using DateTimeFormatter to parse date : source code link. DateTimeFormatter doesn't support different date/timestamp parsing semantics like TimestampConversion.
@kecookier
@NEUpanning DateTimeFormatter has different parsing "types". My point was that, even if none of these types support the exact parsing semantics you need today, they could be extended to do so:
https://github.com/facebookincubator/velox/blob/main/velox/functions/lib/DateTimeFormatter.h#L26
@pedroerp I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat to support legacy date parsing behavior in unix_timestamp function. And i think we need to add a flag to indicate which DateTimeFormatterType to use in Spark unix_timestamp function. How do you think?
Sounds reasonable. Do you have any examples on how the parsing compares with Joda parsing?
Cc: @mbasmanova who recently enhanced this part of the code.
@pedroerp Here is an example I wrote from scratch.
code :
import java.text.ParseException;
import java.text.SimpleDateFormat;
import org.joda.time.DateTime;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;import java.util.Date;
public class Test {
static SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd");
static DateTimeFormatter jodaFormatter = DateTimeFormat.forPattern("yyyy-MM-dd");
public static void invalidMonth() {
// Invalid month (13)
String dateString = "2023-13-01";
// SimpleDateFormat can parse the invalid date
try {
Date date = simpleDateFormat.parse(dateString);
System.out.println("SimpleDateFormat parsed date: " + date);
} catch (ParseException e) {
System.out.println("SimpleDateFormat: " + e.getMessage());
}
// Joda-Time will throw exception
try {
DateTime dateTime = jodaFormatter.parseDateTime(dateString);
System.out.println("Joda-Time parsed date: " + dateTime);
} catch (IllegalArgumentException e) {
System.out.println("Joda-Time: " + e.getMessage());
}
}
public static void dateWithExtraCharacters() {
// Date with extra Characters
String dateString = "2023-12-01 00:00:00";
// SimpleDateFormat does not use all characters up to the end of the string
try {
Date date = simpleDateFormat.parse(dateString);
System.out.println("SimpleDateFormat parsed date: " + date);
} catch (ParseException e) {
System.out.println("SimpleDateFormat: " + e.getMessage());
}
// Joda-Time consumes all characters
DateTimeFormatter jodaFormatter = DateTimeFormat.forPattern("yyyy-MM-dd");
try {
DateTime dateTime = jodaFormatter.parseDateTime(dateString);
System.out.println("Joda-Time parsed date: " + dateTime);
} catch (IllegalArgumentException e) {
System.out.println("Joda-Time: " + e.getMessage());
}
}
public static void main(String[] args) throws ParseException {
invalidMonth();
dateWithExtraCharacters();
}
}
output :
SimpleDateFormat parsed date: Mon Jan 01 00:00:00 CST 2024
Joda-Time: Cannot parse "2023-13-01": Value 13 for monthOfYear must be in the range [1,12]
SimpleDateFormat parsed date: Fri Dec 01 00:00:00 CST 2023
Joda-Time: Invalid format: "2023-12-01 00:00:00" is malformed at " 00:00:00"
@pedroerp I'm implementing new DateTimeFormatterType and i see DateTimeFormatSpecifier doesn't support Week in month pattern used by java.text.SimpleDateFormat. Should i implement it in another PR or in the PR that addresses this issue?
Should i implement it in another PR
@NEUpanning A separate PR would be nice. Smaller PRs are easier to review. Thanks.
Sounds reasonable. Do you have any examples on how the parsing compares with Joda parsing?
Cc: @mbasmanova who recently enhanced this part of the code.
Here is the difference between joda and SimpleDateFormat:
| Field (SimpleDateFormat) | Field (Joda) | Desc | Velox Support |
|---|---|---|---|
| C | century of era (>=0) | ||
| F | Day of week in month | Unsupported | |
| L | M | Month in year (standalone form) | |
| M | Month in year (context sensitive) | Unsupported | |
| W | Week in month | https://github.com/facebookincubator/velox/pull/10511 | |
| Y | year of era (>=0) | ||
| Y | x | Week year | Unsupported |
| Z | Z | Time zone | Unsupported |
| u | e | day of week |
@pedroerp I'd like to add a new
DateTimeFormatterTypecalledSIMPLEfor Java'sjava.text.SimpleDateFormatto support legacy date parsing behavior inunix_timestampfunction. And i think we need to add a flag to indicate whichDateTimeFormatterTypeto use in Sparkunix_timestampfunction. How do you think?
I'm in favor of this approach too, because there are conflicting parts between the two formats. cc @rui-mo, @NEUpanning
I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat
@ccat3z Thanks for providing the details. Sounds good to me generally. What are the functions that are impacted by this date format issue, if you could list them?
I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat
@ccat3z Thanks for providing the details. Sounds good to me generally. What are the functions that are impacted by this date format issue, if you could list them?
from_unixtime, to_unix_timestamp
We encountered more result mismatch issues due to Velox joda Dateformatter not aligning with Spark SimpleDateFormat. I did some research on Spark SimpleDateFormat. And I updated the issue description to include the main difference between Spark SimpleDateFormat and Velox Joda and the tasks to resolve this issue. cc @rui-mo @mbasmanova @kecookier @pedroerp