arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17425: [R] `lubridate::as_datetime()` in dplyr query should be able to handle time in sub seconds

Open eitsupi opened this issue 3 years ago • 3 comments

This change allows strings containing sub-seconds and double types to be used as input to lubridate::as_datetime().

1.5 |>
  arrow::arrow_table(x = _) |>
  dplyr::mutate(
    y = lubridate::as_datetime(x)
  ) |>
  dplyr::collect() |>
  dplyr::mutate(
    z = lubridate::as_datetime(x),
    is_equal = (y == z)
  )
#>     x                   y                   z is_equal
#> 1 1.5 1970-01-01 00:00:01 1970-01-01 00:00:01     TRUE

And, because the timestamp type generated by as_datetime is expected to be used in combination with other functions, fix the bug of as.Date and lubridate::as_date that could cause an error if a sub-seconds timestamp was entered.

As a breaking change, the return type of as_datetime() will be nanoseconds, but I hope this will not have a major impact, since originally as_datetime() |> as.integer() or as_datetime() |> as.numeric() could not be used because it would try to cast to int32 or double, resulting in an error. (We have to cast timestamp to int64)

arrow 9.0.0

1 |>
  arrow::arrow_table(x = _) |>
  dplyr::mutate(
    x = lubridate::as_datetime(x),
    y = cast(x, arrow::int64())
  ) |>
  dplyr::collect()
#>                     x y
#> 1 1970-01-01 00:00:01 1

This PR

1 |>
  arrow::arrow_table(x = _) |>
  dplyr::mutate(
    x = lubridate::as_datetime(x),
    y = cast(x, arrow::int64())
  ) |>
  dplyr::collect()
#>                     x          y
#> 1 1970-01-01 00:00:01 1000000000

eitsupi avatar Aug 16 '22 04:08 eitsupi

https://issues.apache.org/jira/browse/ARROW-17425

github-actions[bot] avatar Aug 16 '22 04:08 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Aug 16 '22 04:08 github-actions[bot]

since originally as_datetime() |> as.integer() or as_datetime() |> as.numeric() could not be used because it would try to cast to int32 or double, resulting in an error.

Maybe we have to add these test cases (and JIRA issues for implementing these ARROW-17428).

eitsupi avatar Aug 16 '22 08:08 eitsupi

This PR seems to have stopped working because of #14553. I don't know what the problem is.

eitsupi avatar Dec 15 '22 14:12 eitsupi

Additional apologies for letting this one slip by our PR review triage! I will look at this tomorrow 🙂

paleolimbot avatar Jan 05 '23 20:01 paleolimbot

If I'm reading this properly, as.integer() on a timestamp will error? If that's the case, this wouldn't be a breaking change? In any case silently ignoring fractional seconds seems worse?

I took a quick skim and I don't see any problems off the top of my head. There might possibly be a performance hit from the added complexity of the function but my preference is always safety >>> speed (until somebody complains).

paleolimbot avatar Jan 10 '23 17:01 paleolimbot

@paleolimbot Nah, what I mean is that casting a datetime to an integer via cast() will return the integer value in nanoseconds rather than seconds.

thisisnic avatar Jan 10 '23 18:01 thisisnic

I think that we always want cast() to do whatever Arrow compute does, and in this case, the cast() just drops the metadata (true for most of the datetime types I know about). That property is useful for doing things like this that require that value.

paleolimbot avatar Jan 10 '23 19:01 paleolimbot

I think that we always want cast() to do whatever Arrow compute does, and in this case, the cast() just drops the metadata (true for most of the datetime types I know about). That property is useful for doing things like this that require that value.

OK, makes sense. In that case then this looks pretty much good to go, will just give it a final look over.

thisisnic avatar Jan 10 '23 20:01 thisisnic

If I'm reading this properly, as.integer() on a timestamp will error?

Yes, this needs to be implemented in the future. (#20369)

eitsupi avatar Jan 11 '23 11:01 eitsupi

Benchmark runs are scheduled for baseline = 872bbda4eb2e75690f0c756c616c171ad6615739 and contender = 157062d4b3a8709555ce4abae10c554e4df05cca. 157062d4b3a8709555ce4abae10c554e4df05cca is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Finished :arrow_down:0.69% :arrow_up:0.0%] test-mac-arm [Finished :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:0.5% :arrow_up:0.12%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 157062d4 ec2-t3-xlarge-us-east-2 [Finished] 157062d4 test-mac-arm [Finished] 157062d4 ursa-i9-9960x [Finished] 157062d4 ursa-thinkcentre-m75q [Finished] 872bbda4 ec2-t3-xlarge-us-east-2 [Finished] 872bbda4 test-mac-arm [Finished] 872bbda4 ursa-i9-9960x [Finished] 872bbda4 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Jan 11 '23 18:01 ursabot