cudf
cudf copied to clipboard
[BUG] Concat `Index` behavior diverts from `pandas`
As @wence- points out here https://github.com/rapidsai/cudf/pull/15623#discussion_r1586054947
We allow concatenating indexes whilst pandas
does not.
Do we like this difference in behavior? Do we want parity? Would users be upset if we changed this behavior?
I don't think we want to support concatenating indexes, but I think cudf does rely on this behavior internally. I have a vague recollection of trying to remove this at one point and realizing that there were dependent changes required to make that happen. It looks like you've started adding deprecations for this in #15650, which is a great way to check this. Since our test suite runs with warnings as errors, if there are internal usages of concatenating indexes we'll see them as test failures there. Let's see how that goes!
@vyasr assuming we have reasonable test coverage, based on my PR, the areas with dependent code are related to:
-
cudf
Index
es (not a big shocker there) -
cudf
parquets -
dask_cudf
parquets -
dask_cudf
accessor -
dask_cudf
core
OK awesome. It looks like you've started trying to catch those warnings in #15650, which is a good way to make sure that you've tracked down all the cases. However, I don't think we want to merge that exactly as is, because if we do then our users will start seeing a lot of warnings from APIs that they can't do anything about. We do want to verify that cudf.concat
is itself going to throw the warning when users directly use it in a way that will be unsupported in the future, but we want to make sure that users don't see warnings due to our own code internally using deprecated functionality. We should start working through those APIs to see how best to rewrite them so that they don't rely on the deprecated functionality anymore. Does that make sense? Feel free to ask questions here (or on the PR)!
@vyasr that makes sense, i'll set the PR to draft mode until i can work through 2, 3, 4, & 5 listed above (the only failing unit test for 1 directly calls cudf.concat
, so i think we don't need an investigation there)
Thanks! Feel free to let us know if you run into any issues and need some help.
@vyasr & cc @wence-
dask
expects to be able to concat
Index
objects: https://github.com/dask/dask/blob/main/dask/dataframe/dispatch.py#L50
cudf
calls dask
, which then calls the concat
in cudf
via this definition: https://github.com/rapidsai/cudf/blob/branch-24.06/python/dask_cudf/dask_cudf/backends.py#L276
based on the above, it seems to me that we need a function which supports the concatenation of cudf.BaseIndex
objects. if you agree this is the case, then i think we shouldn't deprecate this functionality. some possibilities that come to mind (i am missing context here):
- move the concatenation of
cudf.BaseIndex
to a function other thanreshape
'sconcat
. make the func specific tocudf.BaseIndex
. make a newbackends
func to handle this. - leave the functionality in
reshape
'sconcat
, accept we diverge frompandas
functionality forconcat
- remove the associated calls to
dask
so we don't need to worry about what they're doing. this seems like something that wouldn't be worth our time - convince the
dask
team to deprecate their concatenation? this seems like something not worth pursuing.
what do you think?
Good analysis! You've found most of the relevant places, but what we ought to also look at is how dask implements concatenation for pandas indexes. It looks dask uses append for pandas indexes. You can see the implementation in dask for pandas dataframes here. The append method was removed for Series and DataFrame in pandas 2.0, but it still exists for indexes. So what we probably want to do is update the dispatch in the dask_cudf backend that you linked to to use the index append method.
In the long run, we might want to discuss with pandas developers whether there should be more uniformity between the different types and whether they use append or concat. @mroeschke might know how and why pandas ended up in the current heterogeneous state. However, any such changes in pandas would be a longer-term thing, so we should be able to proceed with the above plan in the interim.
thanks so much for the context, @vyasr! i'll plan to proceed with 1 via the usage of index append -- to me, the implementation in dask pandas is ugly. i think it'd be easier to follow if we separated indices from other objects
In the long run, we might want to discuss with pandas developers whether there should be more uniformity between the different types and whether they use append or concat. @mroeschke might know how and why pandas ended up in the current heterogeneous state.
IIRC pandas.Series.append
and pandas.DataFrame.append
were deprecated because similar and more performance functionality existed in pandas.concat
; however, pandas.concat
never worked on pandas.Index
objects that why .append
exists on pandas.Index
.
I'm not sure if pandas.concat
will ever support pandas.Index
objects though since it's used for "concat pandas objects that have axes"
Hmm OK yeah I guess that sort of makes sense. I really wish there were fewer ways in which we needed to special-case for indexes.
Hi @er-eis have you had a chance to work on this (or any of the other issues for which you opened PRs)?
@vyasr, i'll try to carve some time out next week to work on this. been slammed with other responsibilities lately
No problem, just checking in. Thanks for the update!
things keep popping up and i'm about to start a project that will last at least 6 months, so if someone else has time in the next 6 months they should wrap this up