cudf [BUG] Concat `Index` behavior diverts from `pandas`

As @wence- points out here https://github.com/rapidsai/cudf/pull/15623#discussion_r1586054947

We allow concatenating indexes whilst pandas does not.

Do we like this difference in behavior? Do we want parity? Would users be upset if we changed this behavior?

May 04 '24 04:05 er-eis

I don't think we want to support concatenating indexes, but I think cudf does rely on this behavior internally. I have a vague recollection of trying to remove this at one point and realizing that there were dependent changes required to make that happen. It looks like you've started adding deprecations for this in #15650, which is a great way to check this. Since our test suite runs with warnings as errors, if there are internal usages of concatenating indexes we'll see them as test failures there. Let's see how that goes!

May 07 '24 23:05 vyasr

@vyasr assuming we have reasonable test coverage, based on my PR, the areas with dependent code are related to:

cudf Indexes (not a big shocker there)
cudf parquets
dask_cudf parquets
dask_cudf accessor
dask_cudf core

May 07 '24 23:05 er-eis

OK awesome. It looks like you've started trying to catch those warnings in #15650, which is a good way to make sure that you've tracked down all the cases. However, I don't think we want to merge that exactly as is, because if we do then our users will start seeing a lot of warnings from APIs that they can't do anything about. We do want to verify that cudf.concat is itself going to throw the warning when users directly use it in a way that will be unsupported in the future, but we want to make sure that users don't see warnings due to our own code internally using deprecated functionality. We should start working through those APIs to see how best to rewrite them so that they don't rely on the deprecated functionality anymore. Does that make sense? Feel free to ask questions here (or on the PR)!

May 08 '24 23:05 vyasr

@vyasr that makes sense, i'll set the PR to draft mode until i can work through 2, 3, 4, & 5 listed above (the only failing unit test for 1 directly calls cudf.concat, so i think we don't need an investigation there)

May 09 '24 03:05 er-eis

Thanks! Feel free to let us know if you run into any issues and need some help.

May 15 '24 23:05 vyasr

@vyasr & cc @wence-

dask expects to be able to concat Index objects: https://github.com/dask/dask/blob/main/dask/dataframe/dispatch.py#L50

cudf calls dask, which then calls the concat in cudf via this definition: https://github.com/rapidsai/cudf/blob/branch-24.06/python/dask_cudf/dask_cudf/backends.py#L276

based on the above, it seems to me that we need a function which supports the concatenation of cudf.BaseIndex objects. if you agree this is the case, then i think we shouldn't deprecate this functionality. some possibilities that come to mind (i am missing context here):

move the concatenation of cudf.BaseIndex to a function other than reshape's concat. make the func specific to cudf.BaseIndex. make a new backends func to handle this.
leave the functionality in reshape's concat, accept we diverge from pandas functionality for concat
remove the associated calls to dask so we don't need to worry about what they're doing. this seems like something that wouldn't be worth our time
convince the dask team to deprecate their concatenation? this seems like something not worth pursuing.

what do you think?

May 19 '24 13:05 er-eis

Good analysis! You've found most of the relevant places, but what we ought to also look at is how dask implements concatenation for pandas indexes. It looks dask uses append for pandas indexes. You can see the implementation in dask for pandas dataframes here. The append method was removed for Series and DataFrame in pandas 2.0, but it still exists for indexes. So what we probably want to do is update the dispatch in the dask_cudf backend that you linked to to use the index append method.

In the long run, we might want to discuss with pandas developers whether there should be more uniformity between the different types and whether they use append or concat. @mroeschke might know how and why pandas ended up in the current heterogeneous state. However, any such changes in pandas would be a longer-term thing, so we should be able to proceed with the above plan in the interim.

May 20 '24 22:05 vyasr

thanks so much for the context, @vyasr! i'll plan to proceed with 1 via the usage of index append -- to me, the implementation in dask pandas is ugly. i think it'd be easier to follow if we separated indices from other objects

May 21 '24 01:05 er-eis

In the long run, we might want to discuss with pandas developers whether there should be more uniformity between the different types and whether they use append or concat. @mroeschke might know how and why pandas ended up in the current heterogeneous state.

IIRC pandas.Series.append and pandas.DataFrame.append were deprecated because similar and more performance functionality existed in pandas.concat; however, pandas.concat never worked on pandas.Index objects that why .append exists on pandas.Index.

I'm not sure if pandas.concat will ever support pandas.Index objects though since it's used for "concat pandas objects that have axes"

May 21 '24 18:05 mroeschke

Hmm OK yeah I guess that sort of makes sense. I really wish there were fewer ways in which we needed to special-case for indexes.

May 21 '24 23:05 vyasr

Hi @er-eis have you had a chance to work on this (or any of the other issues for which you opened PRs)?

Jul 19 '24 22:07 vyasr

@vyasr, i'll try to carve some time out next week to work on this. been slammed with other responsibilities lately

Jul 21 '24 04:07 er-eis

No problem, just checking in. Thanks for the update!

Jul 22 '24 16:07 vyasr

things keep popping up and i'm about to start a project that will last at least 6 months, so if someone else has time in the next 6 months they should wrap this up

Aug 03 '24 15:08 er-eis

cudf cudf copied to clipboard

[BUG] Concat `Index` behavior diverts from `pandas`

cudf
cudf copied to clipboard