ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: improve map(), struct(), array()

Open NickCrews opened this issue 11 months ago • 6 comments

fixes https://github.com/ibis-project/ibis/issues/8289

This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in some cases.

Changes:

This is adding support for passing in None to all these constructors. These use the new ibis.null(<type>) API to return op.Literal(None, <type>)s

Make these constructors idempotent: you can pass in existing Expressions into array(), etc. The type argument for all of these now always has an effect, not just when passing in python literals. So basically it acts like a cast.

A big structural change is that now ops.Array has an optional attribute "dtype", so if you pass in a 0-length sequence of values the op still knows what dtype it is.

Several of the backends were always broken here, they just weren't getting caught. I marked them as broken, we can fix them in a followup.

You can test this locally with eg pytest -m <backend> -k factory ibis/backends/tests/test_array.py ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py

Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns.

Also, fix executing Literal(None) on pandas and polars, 0-length arrays on polars

Also, fixing converting dtypes on clickhouse, Structs should be converted to nonnullable dtypes.

Also, implement ops.StructColumn on pandas and dask

Next steps

A possible nice follow-up would be to completely remove supporting structs, arrays, and maps inside ops.Literal. Currently, when someone calls ibis.array([1,2]), that now results in ops.Array, but if they do ibis.literal([1,2]), it is represented as ops.Literal. It would be nice if inside ibis.literal() there was some logic like

...
if isinstance(val, Mapping):
    return ibis.map(val)
if is_iterable(val):
    return ibis.array(val)
...

and then all the backends wouldn't have to even worry about supporting these nested types in visit_Literal()

NickCrews avatar Mar 15 '24 16:03 NickCrews

in my first attempt at https://github.com/ibis-project/ibis/pull/8649, I went about it the wrong way, but I needed to ensure something was an Array, but the incoming thing might already have been an array.

NickCrews avatar Mar 15 '24 17:03 NickCrews

OK, as I added those tests I found some problems, and this snowballed into also being a fix for https://github.com/ibis-project/ibis/issues/8289. The implementations for both of these things are somewhat intertwined. It would be a pain to fix one, and then just change it again to fixx the other. So I just did everything at once.

NickCrews avatar Mar 15 '24 19:03 NickCrews

My goodness this gotten a bit out of hand.

EDIT: actually in the latest version it's not so bad. I didn't need to overhaul the internal representation after all.

OK I just pushed an absolute doozy of a commit in https://github.com/ibis-project/ibis/pull/8666/commits/530fad0e29d37b52880b2767f527d3abc8790fdb. Curious how many backends that fails on. It required really overhauling the internal representation of ops.Array and ops.Struct to be able to distinguish between

  • the overall value is NULL
  • the contained values is NULL Take a look at that commit description.

NickCrews avatar Mar 22 '24 04:03 NickCrews

Hi @NickCrews, checking in here. What's the status on this one? It looks like there are a lot of conflicts and red on CI 😅 .

ncclementi avatar Apr 30 '24 20:04 ncclementi

Yeeesssss this needs a bit of love :) Once my other pr with .cases lands I figured I would pick this up again.

NickCrews avatar May 01 '24 05:05 NickCrews

@cpcloud @jcrist this is ready for review whenever you get the chance. This is a bit of a doozy that might want to get split up into multiple commits/PRs. But I think it would be good to get a check that the overall direction is right before I do that? The summary in the top of this PR is up to date for a summary. Maybe read that and then it would be probably best to look at the tests, and then the factory methods in types/arrays.py, types/maps.py, and types/structs.py

NickCrews avatar May 21 '24 21:05 NickCrews

@NickCrews

Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns.

Also, fix executing Literal(None) on pandas and polars, 0-length arrays on polars

Also, fixing converting dtypes on clickhouse, Structs should be converted to nonnullable dtypes.

Also, implement ops.StructColumn on pandas and dask

Please split these up into separate PRs.

Is

A big structural change is that now ops.Array has an optional attribute "dtype", so if you pass in a 0-length sequence of values the op still knows what dtype it is.

required for this PR?

Can't we handle this in separate PR? It seems like it's just to handle the empty array case.

cpcloud avatar May 31 '24 16:05 cpcloud

Please split these up into separate PRs.

Yeah will do :)

It seems like it's just to handle the empty array case.

It is. This is because currently empty arrays are represented by ops.Literals. This leads to needing to compile arrays inside visit_Literal. After this PR, ops.Array can handle empty arrays (ops.Struct and ops.Map already can I think), so we are then free to move towards removing all nested types from ops.Literal, and visit_Literal will only need to deal with non-nested values. What do you think, is this worth the churn?

NickCrews avatar May 31 '24 22:05 NickCrews

visit_Literal will only need to deal with non-nested values. What do you think, is this worth the churn?

I think this is worth exploring for sure. I'm not 100% sure it will be a net improvement without implementing it though :) It sounds like a nice improvement if it works out though.

cpcloud avatar Jun 01 '24 13:06 cpcloud

I'm picking off some of the changes in individual PRs, I'll let you know when this is ready for another look here.

NickCrews avatar Jun 04 '24 06:06 NickCrews

This PR is still too big. Can you split the array, map and struct implementations into three separate PRs?

cpcloud avatar Jun 13 '24 13:06 cpcloud

Yeah splitting it probably is good.

  • map is easy, no semantic changes going on there
  • for struct, I want to add tests for empty structs. Therefore I think we need to decide on and merge my other PR regarding this behavior
  • for arrays, it's gonna be a little hairy too, because I am implementing empty arrays at the same time as adjusting the API. I don't really want to do them separately or it will just be churn. Unless you see a decent way to address those two things separately

NickCrews avatar Jun 13 '24 14:06 NickCrews

@NickCrews Just checking in here, can you split these up some more?

cpcloud avatar Jun 27 '24 11:06 cpcloud

OK, I just tried to split the struct and the map stuff into their own PRs, but both of those rely on the new ibis.array(arg, typ) API. So we need to get the array stuff in first. I'll split that out.

NickCrews avatar Jun 27 '24 19:06 NickCrews

split this into

  • #9458
  • #9460
  • #9459

NickCrews avatar Jun 27 '24 19:06 NickCrews