Parse (non-MultiIndex) label-based keys to structured data
Description
Following on from #13534, this extends the scheme to handle label-based lookups as long as the index is not a multiindex.
As is the case for positional indexing, all of the different ways one can index a frame with labels eventually boil down to indexing by slice, boolean mask, map, or scalar in libcudf. loc-based keys are parsed into information that tags them by type. Since this information is the same as is used for iloc indexing, we can then dispatch to the same "internal" calls that don't do further bounds-checking or normalisation: rather than converting a label-based lookup to an argument we can pass to iloc-getitem (which must reinterpret it), we just take the decision straight away.
The next stage (which will help to remove a bunch of code) will be to handle multiindex keys, but that will be sufficiently complicated that I'd rather do it separately.
- Closes #13014
- Closes #13268
- Closes #13379
- Closes #13652
- Closes #13653
- Closes #13658 (or at least, maintains the correctness)
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Explicitly requested the same set of reviewers as #13534.
On the one hand, I would have liked to handle multiindex lookups as well in one change, but unfortunately I think the rules are too complicated and I'm still waiting for clarification on a bunch of ambiguities in the way pandas handles things. As a consequence this doesn't remove as much code as one would have hoped.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
/ok to test
@wence- did this stall out just waiting for reviews? If so, I'm very sorry I lost track of it. Please bring it up to date and I'm happy to review this ASAP (probably for 24.08 not 24.06 at this point though).
@wence- did this stall out just waiting for reviews? If so, I'm very sorry I lost track of it. Please bring it up to date and I'm happy to review this ASAP (probably for 24.08 not 24.06 at this point though).
Sorry, it got rather struck on my end, I will try and resurrect for 24.10
I am loathe to close this PR outright given how many issues that it fixes, but I also fear that it is woefully out of sync with the current state of the code base given how much @mroeschke has refactored the underlying code. Also, I doubt that @wence- will have time to return to this work any time soon given the various other major pushes that he is currently involved with.
@mroeschke and @wence- do you think it might make sense for there to be some knowledge transfer here (in the form of pair coding or simply discussion) to try and get this work over the finish line, perhaps in the form of multiple smaller and more easily digestible PRs?
Happy to try and do that, yes
Yeah I can try to digest @wence-'s change here and reimplement it amidst all the refactors. I can reach out to @wence- if I need any clarification.
Yeah I can try to digest @wence-'s change here and reimplement it amidst all the refactors. I can reach out to @wence- if I need any clarification.
Note I also flamed out a bit on the multiindex stuff because I couldn't understand what pandas was doing (and at least when I was looking, was ambiguous in some cases).
See, for reference:
- pandas-dev/pandas#53535
- pandas-dev/pandas#53613
- pandas-dev/pandas#53995
- pandas-dev/pandas#53996
- pandas-dev/pandas#54009
- pandas-dev/pandas#54019