ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: Add to_date and to_time to StringValue

Open saschahofmann opened this issue 1 year ago • 15 comments

Is your feature request related to a problem?

No response

Describe the solution you'd like

I think it would make sense to have date and time parsing functions similar to the already existing to_timestamp function.

A workaround right now could be to use to_timestamp and extract the time or date part but e.g. BigQuery supports extra parsing functions PARSE_TIME and PARSE_DATE

What version of ibis are you running?

8.0.0

What backend(s) are you using, if any?

BigQuery, Postgres

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

  • [x] to_date #9030

  • [ ] to_time

saschahofmann avatar Apr 06 '24 10:04 saschahofmann

@saschahofmann Your request makes sense! I'll look to implement this, with to_date and to_time falling back to to_timestamp followed by extracting the appropriate component for backends that don't support it directly.

deepyaman avatar Apr 10 '24 22:04 deepyaman

Cool. Also to happy to look into it but I would need some pointers on where these things are implemented.

saschahofmann avatar Apr 11 '24 08:04 saschahofmann

Cool. Also to happy to look into it but I would need some pointers on where these things are implemented.

Amazing!

So, to_timestamp is defined here: https://github.com/ibis-project/ibis/blob/356e4593f21bf662f5e55f307694a616c712eeec/ibis/expr/types/strings.py#L1275-L1302

You would need to add functions for to_date and to_time to the API. I'd highly recommend just starting with one; probably to_date will run into less issues (e.g. with timezones).

You'll also need to add a corresponding operator: https://github.com/ibis-project/ibis/blob/356e4593f21bf662f5e55f307694a616c712eeec/ibis/expr/operations/temporal.py#L74

Finally, you should add tests, like https://github.com/ibis-project/ibis/blob/main/ibis/backends/tests/test_temporal.py#L1431

Once you've done this, you implement translations. For example:

  • Base SQLGlot translation: https://github.com/ibis-project/ibis/blob/356e4593f21bf662f5e55f307694a616c712eeec/ibis/backends/sql/compiler.py#L281
  • BigQuery override, if necessary: https://github.com/ibis-project/ibis/blob/356e4593f21bf662f5e55f307694a616c712eeec/ibis/backends/bigquery/compiler.py#L247
  • A bunch of other overrides, as necessary: https://github.com/search?q=repo%3Aibis-project%2Fibis%20stringtotimestamp&type=code

Since you're defining a new operator, it would require implementation across a lot of places in the codebase; let me know if this feels like something you want to tackle, first time diving into the codebase. Regardless, start small! You can fail the test for everything except BigQuery to start with, if you'd like. Also, happy to structure this in a way to set up some of the bases and you can focus on adding implementations for the backends you prefer; just let me know.

deepyaman avatar Apr 11 '24 13:04 deepyaman

Oh, also; I noticed that .to_timestamp().time() will not work across the board, because DuckDB doesn't support casting timestamp with timezone to time. So may need to experiment and find a nice approach. Part of why I think it's better to start with to_date. :)

deepyaman avatar Apr 11 '24 13:04 deepyaman

I guess in that case, it is acceptable to first drop the timezone?

saschahofmann avatar Apr 11 '24 15:04 saschahofmann

I guess in that case, it is acceptable to first drop the timezone?

In general, I think we defer to the behavior of individual backends to some extent on these things. In the case of DuckDB, my understanding is they don't believe this to be a safe operation, and I'm somewhat inclined to agree. I'll also defer to @cpcloud, given he has a much clearer view on how Ibis usually treats these kinds of things.

deepyaman avatar Apr 11 '24 17:04 deepyaman

Alright. I finally started on adding to_date and already got a question on the SQLGlot implementation

  • Base SQLGlot translation: https://github.com/ibis-project/ibis/blob/356e4593f21bf662f5e55f307694a616c712eeec/ibis/backends/sql/compiler.py#L281

I assume str_to_time is a SQLGlot function? Maybe this https://sqlglot.com/sqlglot/dialects/dialect.html#str_to_time_sql?

I don't see an equivalent for str_to_date. As we discussed the fallback could be to do to_timestamp().date(). How/where would I put that?

saschahofmann avatar Apr 19 '24 09:04 saschahofmann

I also took a look at bigquery/compiler.py but I have no idea whether something like this would just work.

def visit_StringToDate(self, op, *, arg, format_str):
    return self.f.parse_date(format_str, arg)

or whether I can just add parse_date to the SIMPLE_OPS?

saschahofmann avatar Apr 19 '24 09:04 saschahofmann

@saschahofmann -- if arg and format_str are the args defined on the StringToDate op, then yes, your first example will work.

and in this case, since all the kwargs defined on the op are used in parse_date, you can just add it to SIMPLE_OPS

gforsyth avatar Apr 19 '24 14:04 gforsyth

But do I need to both? Or is one of them enough?

saschahofmann avatar Apr 19 '24 15:04 saschahofmann

Just the one is sufficient

gforsyth avatar Apr 19 '24 16:04 gforsyth

Alright I'll finally have time to tackle the second part of this and add a to_time function if wanted. @deepyaman mentioned that DuckDB doesn't support this on purpose and that @cpcloud would have an opinion on that. Do you want to support to_time and default back to just to_timestamp().time? Or should backends like DuckDB throw an error?

saschahofmann avatar Jun 02 '24 07:06 saschahofmann

Side note - if we're adding new conversion API methods, I think we might want to use an as_* prefix (or something else), keeping to_* methods for things that execute like to_pandas. See #9788 for more context.

jcrist avatar Aug 07 '24 17:08 jcrist

Alright I'll finally have time to tackle the second part of this and add a to_time function if wanted. @deepyaman mentioned that DuckDB doesn't support this on purpose and that @cpcloud would have an opinion on that. Do you want to support to_time and default back to just to_timestamp().time? Or should backends like DuckDB throw an error?

@saschahofmann Very sorry for not getting back to you sooner on this. Where possible, it seems it would make sense to use the time-specific parsing function (e.g. PARSE_TIME on BigQuery, TO_TIME on Snowflake). For DuckDB, you can just cast (e.g. CAST('1992-09-20 11:30:00.123456' AS TIME)).

deepyaman avatar Aug 07 '24 18:08 deepyaman

Alright! I will try to get to this soon!

saschahofmann avatar Aug 13 '24 14:08 saschahofmann