ibis
ibis copied to clipboard
feat: support empty arrays, improve ibis.array() API
Picking out the array stuff from https://github.com/ibis-project/ibis/pull/8666. I think this is a better approach than https://github.com/ibis-project/ibis/pull/9458
Instead of trying to fit the two cases of 0-length and 1+ length arrays into the same op, I split them up into separate ones. By doing this, if we guarantee that all the elements of ops.Array() have the right type before construction, we don't have to store an explicit dtype on ops.Array, and instead can just use rlz.highest_precedence_dtype(). And we we don't have to do fancy casting during compilation, all the elements will already have been casted as needed.
This allows for the compilation of arraycast [..] as STRUCT<...>[],
which postgres freaks about.
Instead, if we cast each individual element,
such as [cast({...} as ROW..., cast({...} as ROW...], then this is valid SQL.
I added a Length annotation to ops.Array to verify the length is 1+. IDK, this isn't really needed, since if you ever did construct one, then the rlz.highest_precedence_dtype([]) would fail. But that might fail at a later time, and I wanted to raise an error at construction time. But, end users should never be constructing ops.Arrays directly, so this is a guardrail just for us ibis devs. So IDK, we could remove it, but I think it is a nice hint for future us.
ACTION NEEDED
Ibis follows the Conventional Commits specification for release automation.
The PR title and description are used as the merge commit message.
Please update your PR title and description to match the specification.
@cpcloud OK this is finally looking pretty good and is ready for a review!
I'm gonna be out of office for this week. So I hope when I get back I see that y'all will have figured it all out and everything is merged and happy 😁
@cpcloud whenever you get the chance, I responded to each of the individual comments, curious what you think. I agree it's sorta mungy, but I can't think of a cleaner way of supporting the API I want. I don't really want to cut anything from the API, but are you willing to? Eg not support ibis.array(None)?
Perhaps the larger question here is the tension between
- provide a high level API that is very flexible in what it takes in (None, python lists of python vals, python lists of ibis exprs, ibis array exprs)
- don't over guarantee anything that then we are going to have to continue supporting.
As is reasonable with our roles in ibis (user vs maintainer) It sounds like I care more about the 1st one, you about the 2nd haha 😂
One argument for favoring flexibility is consistency: I think we can both agree supporting list[ir.Value] is needed for ibis.array, .map,and .struct (they already do). but this sort of opens up Pandora's box: now these nested constructors are the only ibis constructor API that accepts ibis values. I want to bring the rest of the constructors in line with this. I think stopping with the scalar constructors would make sense, but also making ibis.memtable more generous with accepting ibis values could be a nice DX improvement, as I advocate for in some other issue.
I'm curious what @kszucs @gforsyth @jcrist think about this tension. Should we make things easier for users, or keep our API maintainable?
Part of the reason that pandas API is so large and complex to use and maintain (in my opinion) is that it favors flexibility over consistency. I think we should err on the side of consistency as a project, and provide flexibility where there's not a ton of additional complexity.
That means that sometimes the complexity of handling multiple input types gets pushed to the user and I would argue that that is the best place to handle non-trivial input normalization since the application has much better information about how it wants those inputs to look.
I am pretty against opening up all the ibis.* constructor APIs to accept other Ibis expressions as it adds a bunch of requirements and decision making around what's valid and how we validate the input beyond what we're already doing.
For nested types and tables, things get really complex as now you have to decide whether to accept mixed Ibis-expression-Python-value objects, and are we going to validate all of that?
Can we please leave all the additional complexity and decisions out here beyond addressing the empty array case and allow the user to specify a type?
Yeah I think this is probably a good idea. I'll trim it down as I can.
To confirm, I am assuming that array_of_arrays = ibis.array([[ibis.literal(1)], [ibis.literal(2)]]) is a valid input. You also want to be able to support this, right?
To confirm, I am assuming that
array_of_arrays = ibis.array([[ibis.literal(1)], [ibis.literal(2)]])is a valid input. You also want to be able to support this, right?
No, I'm saying we shouldn't support this. It opens the door to mixing ibis expressions and python types, and we'd have to unify both and it's a lot of additional complexity.
I currently have lots of code that looks like t.mutate(emails=ibis.array([t.email0, t.email1, t.email2]). Am I missing something here, how can I accomplish this goal?
@NickCrews I am still having a lot of trouble following everything here. I suspect that the PR is still trying to do too much.
Can you provide a minimum list of cases showing
- the current behavior
- the desired behavior
and avoid speculating on specific solutions on how to get the desired behavior? This problem space (complex type casting and inference) is really hairy, and it's very very difficult to review related PRs if they are trying to address a bunch of things at once.
Yes, I agree it is really hairy, and I think that would help. Maybe I start with a PR that just adds or changes tests to precisely show what the current behavior is
@NickCrews what's the status of this one? Is the plan to close it and start a new separate PR with a more targeted scope?