velox icon indicating copy to clipboard operation
velox copied to clipboard

Velox doesn't support legacy date format behavior in Spark SQL

Open NEUpanning opened this issue 1 year ago • 6 comments
trafficstars

Bug description

For backward capability of parsing/formatting of timestamp/date strings expression behavior and adoption of new behaviors. Spark add a setting spark.sql.legacy.timeParserPolicy to indicate whether using legacy behavior. If it is set to LEGACY, Spark doesn't performs strict checking for expression's input. I see Velox doesn't support this when using unix_timestamp function, there is a related gluten issue : link. Maybe there are more result mismatch cases.

NEUpanning avatar Jul 01 '24 04:07 NEUpanning

Maybe we should add support for legacy date format behavior. @pedroerp @rui-mo What do you think? Thanks!

NEUpanning avatar Jul 01 '24 04:07 NEUpanning

@NEUpanning, is Spark's legacy policy widely used in your production environment?

PHILO-HE avatar Jul 01 '24 07:07 PHILO-HE

@PHILO-HE Yes. In Spark upgrades, we need to use legacy policy for backward compatibility rather than pushing users to accept the new behavior. Here is part of legacy policies we are using :

spark.sql.storeAssignmentPolicy Legacy
spark.sql.legacy.createArrayOneBasedIndex false
spark.sql.legacy.setCommandRejectsSparkCoreConfs false
spark.sql.legacy.ctePrecedencePolicy LEGACY
spark.sql.legacy.createEmptyCollectionUsingStringType true
spark.sql.legacy.exponentLiteralAsDecimal.enabled true
spark.sql.legacy.parser.havingWithoutGroupByAsWhere true
spark.sql.legacy.timeParserPolicy LEGACY
spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue true
spark.sql.legacy.fromDayTimeString.enabled true
spark.sql.legacy.allowHashOnMapType true
spark.sql.legacy.allowNegativeScaleOfDecimal true
spark.sql.legacy.followThreeValuedLogicInArrayExists false
spark.sql.legacy.json.allowEmptyString.enabled true
spark.sql.legacy.setopsPrecedence.enabled true
spark.sql.legacy.literal.pickMinimumPrecision false
spark.sql.legacy.sizeOfNull true
spark.sql.legacy.typeCoercion.datetimeToString.enabled true
spark.sql.legacy.doLooseUpcast true
spark.sql.legacy.addMonthsAdjustLastDay true
spark.sql.legacy.statisticalAggregate true

NEUpanning avatar Jul 01 '24 08:07 NEUpanning

@NEUpanning: we have recently added support for different date/timestamp parsing semantics; are these not enough to follow the Spark parsing semantic you described? If not, could you highlight some of the differences?

https://github.com/facebookincubator/velox/blob/main/velox/type/TimestampConversion.h#L48

Cc: @mbasmanova

pedroerp avatar Jul 01 '24 15:07 pedroerp

@pedroerp Thanks for your reply. UnixTimestampParseFunction is using DateTimeFormatter to parse date : source code link. DateTimeFormatter doesn't support different date/timestamp parsing semantics like TimestampConversion.

NEUpanning avatar Jul 02 '24 02:07 NEUpanning

@kecookier

NEUpanning avatar Jul 05 '24 11:07 NEUpanning

@NEUpanning DateTimeFormatter has different parsing "types". My point was that, even if none of these types support the exact parsing semantics you need today, they could be extended to do so:

https://github.com/facebookincubator/velox/blob/main/velox/functions/lib/DateTimeFormatter.h#L26

pedroerp avatar Jul 15 '24 16:07 pedroerp

@pedroerp I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat to support legacy date parsing behavior in unix_timestamp function. And i think we need to add a flag to indicate which DateTimeFormatterType to use in Spark unix_timestamp function. How do you think?

NEUpanning avatar Jul 16 '24 03:07 NEUpanning

Sounds reasonable. Do you have any examples on how the parsing compares with Joda parsing?

Cc: @mbasmanova who recently enhanced this part of the code.

pedroerp avatar Jul 16 '24 03:07 pedroerp

@pedroerp Here is an example I wrote from scratch.

code :

import java.text.ParseException;
import java.text.SimpleDateFormat;
import org.joda.time.DateTime;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;import java.util.Date;

public class Test {
    static SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd");
    static DateTimeFormatter jodaFormatter = DateTimeFormat.forPattern("yyyy-MM-dd");
    public static void invalidMonth() {
        // Invalid month (13)
        String dateString = "2023-13-01";

        // SimpleDateFormat can parse the invalid date
        try {
            Date date = simpleDateFormat.parse(dateString);
            System.out.println("SimpleDateFormat parsed date: " + date);
        } catch (ParseException e) {
            System.out.println("SimpleDateFormat: " + e.getMessage());
        }

        // Joda-Time will throw exception
        try {
            DateTime dateTime = jodaFormatter.parseDateTime(dateString);
            System.out.println("Joda-Time parsed date: " + dateTime);
        } catch (IllegalArgumentException e) {
            System.out.println("Joda-Time: " + e.getMessage());
        }
    }

    public static void dateWithExtraCharacters() {
        // Date with extra Characters
        String dateString = "2023-12-01 00:00:00";

        // SimpleDateFormat does not use all characters up to the end of the string
        try {
            Date date = simpleDateFormat.parse(dateString);
            System.out.println("SimpleDateFormat parsed date: " + date);
        } catch (ParseException e) {
            System.out.println("SimpleDateFormat: " + e.getMessage());
        }

        // Joda-Time consumes all characters
        DateTimeFormatter jodaFormatter = DateTimeFormat.forPattern("yyyy-MM-dd");
        try {
            DateTime dateTime = jodaFormatter.parseDateTime(dateString);
            System.out.println("Joda-Time parsed date: " + dateTime);
        } catch (IllegalArgumentException e) {
            System.out.println("Joda-Time: " + e.getMessage());
        }
    }

    public static void main(String[] args) throws ParseException {
        invalidMonth();
        dateWithExtraCharacters();
    }
}

output :

SimpleDateFormat parsed date: Mon Jan 01 00:00:00 CST 2024
Joda-Time: Cannot parse "2023-13-01": Value 13 for monthOfYear must be in the range [1,12]
SimpleDateFormat parsed date: Fri Dec 01 00:00:00 CST 2023
Joda-Time: Invalid format: "2023-12-01 00:00:00" is malformed at " 00:00:00"

NEUpanning avatar Jul 16 '24 04:07 NEUpanning

@pedroerp I'm implementing new DateTimeFormatterType and i see DateTimeFormatSpecifier doesn't support Week in month pattern used by java.text.SimpleDateFormat. Should i implement it in another PR or in the PR that addresses this issue?

NEUpanning avatar Jul 18 '24 12:07 NEUpanning

Should i implement it in another PR

@NEUpanning A separate PR would be nice. Smaller PRs are easier to review. Thanks.

mbasmanova avatar Jul 18 '24 12:07 mbasmanova

Sounds reasonable. Do you have any examples on how the parsing compares with Joda parsing?

Cc: @mbasmanova who recently enhanced this part of the code.

Here is the difference between joda and SimpleDateFormat:

Field (SimpleDateFormat) Field (Joda) Desc Velox Support
  C century of era (>=0)  
F   Day of week in month Unsupported
L M Month in year (standalone form)  
M Month in year (context sensitive) Unsupported
W   Week in month https://github.com/facebookincubator/velox/pull/10511
  Y year of era (>=0)  
Y x Week year Unsupported
Z Z Time zone Unsupported
u e day of week

@pedroerp I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat to support legacy date parsing behavior in unix_timestamp function. And i think we need to add a flag to indicate which DateTimeFormatterType to use in Spark unix_timestamp function. How do you think?

I'm in favor of this approach too, because there are conflicting parts between the two formats. cc @rui-mo, @NEUpanning

ccat3z avatar Sep 04 '24 02:09 ccat3z

I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat

@ccat3z Thanks for providing the details. Sounds good to me generally. What are the functions that are impacted by this date format issue, if you could list them?

rui-mo avatar Sep 04 '24 03:09 rui-mo

I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat

@ccat3z Thanks for providing the details. Sounds good to me generally. What are the functions that are impacted by this date format issue, if you could list them?

from_unixtime, to_unix_timestamp

ccat3z avatar Sep 04 '24 04:09 ccat3z

We encountered more result mismatch issues due to Velox joda Dateformatter not aligning with Spark SimpleDateFormat. I did some research on Spark SimpleDateFormat. And I updated the issue description to include the main difference between Spark SimpleDateFormat and Velox Joda and the tasks to resolve this issue. cc @rui-mo @mbasmanova @kecookier @pedroerp

NEUpanning avatar Sep 09 '24 11:09 NEUpanning