cudf
cudf copied to clipboard
Deprecate Arrow support in I/O
Description
Checklist
- [ ] I am familiar with the Contributing Guidelines.
- [ ] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
I marked this as "breaking" (even though it's not really breaking), since it's a big deprecation and I think this makes it more prominent in the changelog.
Please relabel if I did this wrong, though :).
@vyasr
As promised, here's the followup deprecating Arrow file support in cudf. This is a pretty invasive PR, though, I wonder if we should wait until we have a good alternative to pyarrow for I/O.
cc @rjzamora
Thanks for pushing on this @lithomas1 ! I'll prioritize this for Monday.
I wonder if we should wait until we have a good alternative to pyarrow for I/O.
Related Note: I started experimenting with a temporary alternative in a personal branch since use_python_file_object=False essentially destroys dask_cudf.read_parquet performance for large files. I didn't get the chance to flesh things out yet, but I'm hopeful that we can work out something reasonable.
Our current thinking regarding prioritization is that we're willing to eat the temporary disruptions for users in 24.10 if we get rid of Arrow before we have a good alternative for remote I/O. We definitely want to continue researching and planning for alternatives, but the tradeoffs are worthwhile at this point.
@rjzamora are you good with the deprecation moving forward in general? If it turns out to be absolutely critical for some reason we can always delay the removal when 24.10 rolls around, but for now I'd like us to operate under the assumption that removal is happening in 24.10.
@rjzamora are you good with the deprecation moving forward in general? If it turns out to be absolutely critical for some reason we can always delay the removal when 24.10 rolls around, but for now I'd like us to operate under the assumption that removal is happening in 24.10.
Yes. We should deprecate anything depending on pyarrow for 24.08, so that we can remove it in 24.10. I am on board with that.
I support this PR moving forward. However, a few important notes about logistics:
- [EDIT: I'm probably mistaken about this - I now believe dask-cudf relies on the default cudf behavior. I was remembering an older variation on the logic (reviewing now)] ~I would consider 24.08 to be blocked until the default
dask_cudf.read_parquetbehavior does not result in a deprecation warning. Unless I'm mistaken, the current form of this PR will result in a lot of unnecessary noise in dask-cudf (even when the user isn't asking to use pyarrow). My preference is to modify the dask-cudf default before (or within) this PR. (I can submit the dask-cudf component if it's helpful).~ - I would consider partial-IO support to be a P0 for 24.08 - Down-stream libraries will feel considerable pain if we need to duplicate a 10GB file 10 times (something that does currently happen when
use_python_file_object=False). I was hoping to introduce the simple/temporary pyarrow alternative before deprecatinguse_python_file_object, but I don't think the ordering is critical. I will likely push a WIP for the workaround today or tomorrow either way (I'll be happy to adjust/coordinate with the changes in this PR).
OK great, it sounds like we are unblocked for 24.08 with the caveat that we need to provide an alternative path for use_python_file_object before we can freeze. I'm fine moving in either order as long as you feel good about being able to hit that release target.
Addressed most of the comments here.
Will update this PR with deprecation for open_file_options sometime later.
I deprecated open_file_options which I think is the last remaining API to be deprecated.
This should be ready for a re-review now.
@rjzamora @wence-
Any other comments here?
@wence- Will you have time to take another look at this?
/merge