ARROW-17313: [C++][Python] Add fragment.slice() for CSV file format
This is going to be a draft PR based on discussions in https://issues.apache.org/jira/browse/ARROW-17313. This PR addresses the CSV format. The Parquet format I believe is covered by already implemented API "subset".
The main theme is we want to be able to specify what portion of a FileFragment we want to read when user does .to_batches() in Python or ScanBatchesAsync() in C++.
The implementation here adds two fields start_byte_ and end_byte_ to the FileFragment class which can get set by using .slice() in Python or calling the setters in C++.
I make sure that the slice() and set_bounds() methods in the Python ParquetFileFormat and C++ respectively do the right thing due to inheritance.
This is meant to be a draft. Happy to update the interface if something else makes more sense.
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename pull request title in the following format?
ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
See also:
https://issues.apache.org/jira/browse/ARROW-17313
https://github.com/apache/arrow/pull/13688 is a very similar PR for Parquet that takes a different implementation approach, I think it's worth coordinating.
What is the status with that PR? I don't see any comments on it or any reviewers assigned.
We don't really assign reviewers, I bring it up because I noticed it and it is basically the same idea as this PR, from a quick glance - so I don't want to see two separate implementations of the same concept
Some comments in the code probably has to be changed. But let's first agree on the actual interface and functionality. @pitrou @westonpace
OK I think I have passed CI now finally, please comment on the interface @pitrou @westonpace
please comment on the interface
I like the python API. As for the C++:
get_read_rangeandset_boundsare inconsistent. They should beget_xyzandset_xyzalthough...- I think
Fragmentshould be immutable. Instead ofset_boundscan we have aSlicemethod like we have in python that returns a newFragment(admittedly, sinceFragmentis an abstract class, we might need to introduce aClone, but I think that's ok)? - Why do I need to supply a range to
OpenRangeif the object already has a range as an instance variable? For that matter, why do I needOpenRangeat all. Wouldn't the normalOpenautomatically open a range view?
- I think Fragment should be immutable. Instead of set_bounds can we have a Slice method like we have in python that returns a new Fragment (admittedly, since Fragment is an abstract class, we might need to introduce a Clone, but I think that's ok)?
The set_bounds is not a method for Fragment. It is a method for FileFragment which is not an abstract class. Are there other Fragments that are not FileFragments? And does a FileFragment have to be immutable as well?
What I can do is probably make a new MakeFragment method in the abstract FileFormat class which will make a FileFragment with a range. The FileFragment object has a protected attribute format_ which we can use for this purpose. When you call slice on a FileFragment it will just call return format_->MakeFragment(...., range).
- Why do I need to supply a range to OpenRange if the object already has a range as an instance variable? For that matter, why do I need OpenRange at all. Wouldn't the normal Open automatically open a range view?
We have a FileFragment object. In file_csv.cc the name for that object is usually file. The FileFragment object has the read_range. It also has a FileSource. OpenRange is a method of the FileSource.
However the FileFragment object doesn't have an open method. OpenReaderForRangeAsync and OpenReaderAsync are called by the CsvFileFormat, who has access to the FIleFragment. So the CsvFileFormat needs to decide which method to call on the FileSource, by looking at the other field in the FileFragment.
Is this clear? Or am I wrong.
The set_bounds is not a method for Fragment. It is a method for FileFragment which is not an abstract class. Are there other Fragments that are not FileFragments? And does a FileFragment have to be immutable as well?
Yes, my wording was a bit incorrect there. There are non-file fragments such as in-memory fragment and fragments backed by python iterators. However, yes, FileFragment needs to be immutable as well so I think we can continue with this change.
What I can do is probably make a new MakeFragment method in the abstract FileFormat class which will make a FileFragment with a range. The FileFragment object has a protected attribute format_ which we can use for this purpose. When you call slice on a FileFragment it will just call return format_->MakeFragment(...., range).
That seems workable. This would be fine and avoids the need for Clone.
However the FileFragment object doesn't have an open method. OpenReaderForRangeAsync and OpenReaderAsync are called by the CsvFileFormat, who has access to the FIleFragment. So the CsvFileFormat needs to decide which method to call on the FileSource, by looking at the other field in the FileFragment.
I missed entirely that OpenRange was part of FileSource and not FileFragment. You are correct, the API you have for OpenRange is fine.
This has been addressed