cudf
cudf copied to clipboard
WIP: Improve cudf::io::datasource::create().
This commit introduces the new datasource_kind and datasource_params data types, and updates the cudf::io::datasource::create() signature to allow parameterized datasource creation.
N.B. Non-compiling checkpoint commit.
Description
Checklist
- [ ] I am familiar with the Contributing Guidelines.
- [ ] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
@tpn Would you be able to avoid force-pushing here? It has a tendency to mess up review comments and we generally avoid it. We always squash on merge, so it is okay to have a messy intermediate history.
@bdice oh interesting, didn't know that's your preferred style for cudf. Sure, I'll switch to a merge workflow going forward.
Apologies for the slow response; we've been thoroughly reevaluating our IO options, in part prompted by your PR :)
We're planning to drastically change the available IO options to avoid redundancy and simplify the behavior.
What we're aiming to end up with:
kvikIO: Include three levels on compatibility mode: force off, allow as fallback, force on. This gives us parity with current options, with much less confusion. It does require changes on kvikIO side so we can't expose this ATM. We also need to change file_source to use kvikIO for host reads, thus removing the internal implementation.
memory map: instead of direct file reads, memory map and copy from the buffer (both host and device reads). Current memory mapping only applies to host reads, so some changes are needed here as well. If kvikIO adds support for memory mapping, this potentially affects the wording of the options as well
odirect: as implemented in this PR, only with fallback to kvikIO in compatibility mode. This way we don't re-implement "vanilla" file reads and can outsource that to kvikIO.
I think under this proposal, if the performance numbers work out, we would end up with a base file_source class that only uses kvikIO, and a derived odirect_source class. Until memory mapping is not added to kvikIO, we would also keep the derived memory_mapped_source class.
Do you see any issues with this approach?
I'm writing all of this because, as far as I can we see, the expanded API is blocked on some of these changes; we'll change available options to a large extent, so exposing them now would imply breaking changes to create later. I believe work on odirect_source is not blocked, but it should probably derive from (soon to be reworked) file_source.
@vuule that all sounds great. I anticipated that the PR would be far more likely to spur I/O model changes like you've described, versus just getting approved and merged as-is, so no worries there. Regarding keeping odirect, as long as it's exposed in such a way that it can be used by downstream cudf users (i.e. include it in a public header file), that's really all that would be needed.
It sounds like it would make more sense to close this PR for now and do new ones for the upcoming work. I can take a stab at an odirect-only based one if that's easiest.
closing as the datasource code has changed quite a bit and we're not even quite done with changes to enable this feature.