arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17313: [C++][Python] Add fragment.slice() for CSV file format

Open marsupialtail opened this issue 3 years ago • 5 comments

This is going to be a draft PR based on discussions in https://issues.apache.org/jira/browse/ARROW-17313. This PR addresses the CSV format. The Parquet format I believe is covered by already implemented API "subset".

The main theme is we want to be able to specify what portion of a FileFragment we want to read when user does .to_batches() in Python or ScanBatchesAsync() in C++.

The implementation here adds two fields start_byte_ and end_byte_ to the FileFragment class which can get set by using .slice() in Python or calling the setters in C++.

I make sure that the slice() and set_bounds() methods in the Python ParquetFileFormat and C++ respectively do the right thing due to inheritance.

This is meant to be a draft. Happy to update the interface if something else makes more sense.

marsupialtail avatar Aug 10 '22 01:08 marsupialtail

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Aug 10 '22 01:08 github-actions[bot]

https://issues.apache.org/jira/browse/ARROW-17313

github-actions[bot] avatar Aug 10 '22 01:08 github-actions[bot]

https://github.com/apache/arrow/pull/13688 is a very similar PR for Parquet that takes a different implementation approach, I think it's worth coordinating.

lidavidm avatar Aug 10 '22 12:08 lidavidm

What is the status with that PR? I don't see any comments on it or any reviewers assigned.

marsupialtail avatar Aug 10 '22 16:08 marsupialtail

We don't really assign reviewers, I bring it up because I noticed it and it is basically the same idea as this PR, from a quick glance - so I don't want to see two separate implementations of the same concept

lidavidm avatar Aug 10 '22 16:08 lidavidm

Some comments in the code probably has to be changed. But let's first agree on the actual interface and functionality. @pitrou @westonpace

marsupialtail avatar Aug 26 '22 17:08 marsupialtail

OK I think I have passed CI now finally, please comment on the interface @pitrou @westonpace

marsupialtail avatar Aug 27 '22 00:08 marsupialtail

please comment on the interface

I like the python API. As for the C++:

  • get_read_range and set_bounds are inconsistent. They should be get_xyz and set_xyz although...
  • I think Fragment should be immutable. Instead of set_bounds can we have a Slice method like we have in python that returns a new Fragment (admittedly, since Fragment is an abstract class, we might need to introduce a Clone, but I think that's ok)?
  • Why do I need to supply a range to OpenRange if the object already has a range as an instance variable? For that matter, why do I need OpenRange at all. Wouldn't the normal Open automatically open a range view?

westonpace avatar Aug 29 '22 20:08 westonpace

  • I think Fragment should be immutable. Instead of set_bounds can we have a Slice method like we have in python that returns a new Fragment (admittedly, since Fragment is an abstract class, we might need to introduce a Clone, but I think that's ok)?

The set_bounds is not a method for Fragment. It is a method for FileFragment which is not an abstract class. Are there other Fragments that are not FileFragments? And does a FileFragment have to be immutable as well?

What I can do is probably make a new MakeFragment method in the abstract FileFormat class which will make a FileFragment with a range. The FileFragment object has a protected attribute format_ which we can use for this purpose. When you call slice on a FileFragment it will just call return format_->MakeFragment(...., range).

  • Why do I need to supply a range to OpenRange if the object already has a range as an instance variable? For that matter, why do I need OpenRange at all. Wouldn't the normal Open automatically open a range view?

We have a FileFragment object. In file_csv.cc the name for that object is usually file. The FileFragment object has the read_range. It also has a FileSource. OpenRange is a method of the FileSource.

However the FileFragment object doesn't have an open method. OpenReaderForRangeAsync and OpenReaderAsync are called by the CsvFileFormat, who has access to the FIleFragment. So the CsvFileFormat needs to decide which method to call on the FileSource, by looking at the other field in the FileFragment.

Is this clear? Or am I wrong.

marsupialtail avatar Aug 30 '22 02:08 marsupialtail

The set_bounds is not a method for Fragment. It is a method for FileFragment which is not an abstract class. Are there other Fragments that are not FileFragments? And does a FileFragment have to be immutable as well?

Yes, my wording was a bit incorrect there. There are non-file fragments such as in-memory fragment and fragments backed by python iterators. However, yes, FileFragment needs to be immutable as well so I think we can continue with this change.

What I can do is probably make a new MakeFragment method in the abstract FileFormat class which will make a FileFragment with a range. The FileFragment object has a protected attribute format_ which we can use for this purpose. When you call slice on a FileFragment it will just call return format_->MakeFragment(...., range).

That seems workable. This would be fine and avoids the need for Clone.

However the FileFragment object doesn't have an open method. OpenReaderForRangeAsync and OpenReaderAsync are called by the CsvFileFormat, who has access to the FIleFragment. So the CsvFileFormat needs to decide which method to call on the FileSource, by looking at the other field in the FileFragment.

I missed entirely that OpenRange was part of FileSource and not FileFragment. You are correct, the API you have for OpenRange is fine.

westonpace avatar Aug 30 '22 13:08 westonpace

This has been addressed

marsupialtail avatar Aug 30 '22 19:08 marsupialtail