cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Deprecate Arrow support in I/O

Open lithomas1 opened this issue 1 year ago • 3 comments

Description

Checklist

  • [ ] I am familiar with the Contributing Guidelines.
  • [ ] New or existing tests cover these changes.
  • [ ] The documentation is up to date with these changes.

lithomas1 avatar Jun 28 '24 19:06 lithomas1

I marked this as "breaking" (even though it's not really breaking), since it's a big deprecation and I think this makes it more prominent in the changelog.

Please relabel if I did this wrong, though :).

lithomas1 avatar Jun 28 '24 23:06 lithomas1

@vyasr

As promised, here's the followup deprecating Arrow file support in cudf. This is a pretty invasive PR, though, I wonder if we should wait until we have a good alternative to pyarrow for I/O.

cc @rjzamora

lithomas1 avatar Jun 29 '24 00:06 lithomas1

Thanks for pushing on this @lithomas1 ! I'll prioritize this for Monday.

I wonder if we should wait until we have a good alternative to pyarrow for I/O.

Related Note: I started experimenting with a temporary alternative in a personal branch since use_python_file_object=False essentially destroys dask_cudf.read_parquet performance for large files. I didn't get the chance to flesh things out yet, but I'm hopeful that we can work out something reasonable.

rjzamora avatar Jun 29 '24 00:06 rjzamora

Our current thinking regarding prioritization is that we're willing to eat the temporary disruptions for users in 24.10 if we get rid of Arrow before we have a good alternative for remote I/O. We definitely want to continue researching and planning for alternatives, but the tradeoffs are worthwhile at this point.

vyasr avatar Jul 01 '24 18:07 vyasr

@rjzamora are you good with the deprecation moving forward in general? If it turns out to be absolutely critical for some reason we can always delay the removal when 24.10 rolls around, but for now I'd like us to operate under the assumption that removal is happening in 24.10.

vyasr avatar Jul 02 '24 05:07 vyasr

@rjzamora are you good with the deprecation moving forward in general? If it turns out to be absolutely critical for some reason we can always delay the removal when 24.10 rolls around, but for now I'd like us to operate under the assumption that removal is happening in 24.10.

Yes. We should deprecate anything depending on pyarrow for 24.08, so that we can remove it in 24.10. I am on board with that.

I support this PR moving forward. However, a few important notes about logistics:

  1. [EDIT: I'm probably mistaken about this - I now believe dask-cudf relies on the default cudf behavior. I was remembering an older variation on the logic (reviewing now)] ~I would consider 24.08 to be blocked until the default dask_cudf.read_parquet behavior does not result in a deprecation warning. Unless I'm mistaken, the current form of this PR will result in a lot of unnecessary noise in dask-cudf (even when the user isn't asking to use pyarrow). My preference is to modify the dask-cudf default before (or within) this PR. (I can submit the dask-cudf component if it's helpful).~
  2. I would consider partial-IO support to be a P0 for 24.08 - Down-stream libraries will feel considerable pain if we need to duplicate a 10GB file 10 times (something that does currently happen when use_python_file_object=False). I was hoping to introduce the simple/temporary pyarrow alternative before deprecating use_python_file_object, but I don't think the ordering is critical. I will likely push a WIP for the workaround today or tomorrow either way (I'll be happy to adjust/coordinate with the changes in this PR).

rjzamora avatar Jul 02 '24 13:07 rjzamora

OK great, it sounds like we are unblocked for 24.08 with the caveat that we need to provide an alternative path for use_python_file_object before we can freeze. I'm fine moving in either order as long as you feel good about being able to hit that release target.

vyasr avatar Jul 02 '24 19:07 vyasr

Addressed most of the comments here.

Will update this PR with deprecation for open_file_options sometime later.

lithomas1 avatar Jul 05 '24 17:07 lithomas1

I deprecated open_file_options which I think is the last remaining API to be deprecated.

This should be ready for a re-review now.

lithomas1 avatar Jul 09 '24 17:07 lithomas1

@rjzamora @wence-

Any other comments here?

lithomas1 avatar Jul 12 '24 15:07 lithomas1

@wence- Will you have time to take another look at this?

rjzamora avatar Jul 17 '24 18:07 rjzamora

/merge

lithomas1 avatar Jul 19 '24 17:07 lithomas1