iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

feat: add `RollingManifestWriter`

Open felixscherz opened this issue 1 year ago • 4 comments

Hi, this is in regards to #596 and still WIP.

The RollingManifestWriter implementation closely follows the java implementation.

It takes in a generator that produces ManifestWriter objects and rolls over to a new one once either the number of rows appended or the file size in bytes exceeds the target value.

It's not finished as of yet, I am still trying to find a good way to access the current file from the underlying reader. I tried to obtain that information from the ManifestWriter._writer.output_stream object, but that is write-only.

Any pointers on how to access the current file size of the manifest writer would help me a lot:)

felixscherz avatar Apr 23 '24 16:04 felixscherz

I believe in the Java implementation we have a concept of a PositionOutputStream which is used to keep track of bytes written to each file with a position/counter. What we can do here is extend the current OutputFile, and OutputStream implementations to account for this and roll each based on that position.

For instance, the ManifestWriter can use the AvroFileAppender which calls it's length() method to determine the size. But this length method is calling the storedLength() method in PositionOutputStream.

https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java#L83

geruh avatar Apr 23 '24 20:04 geruh

Sounds good, I will have a look at the implementation and make a suggestion. Thank you!

felixscherz avatar Apr 24 '24 05:04 felixscherz

Hi, I finally had some time to continue working on this.

Based on your suggestions @geruh I added a tell method to the OutputStream protocol that returns the number of bytes written to the stream. I then added __len__ to the AvroOutputFile which calls out to either OutputFile or OutputStream to get the number of bytes written, depending on whether the stream is closed or not. Finally I extended ManifestWriter with a __len__ method that calls AvroOutputFile.

I initially tried to extend OutputStream with __len__ until I realized that both FileIO implementations fsspec and pyarrow offer OutputStream implementations that implement the tell method while neither supports __len__.

If we wanted to go with __len__ instead of simply using tell we might have to implement custom FsspecOutputStream and PyarrowOutputStream classes that implement __len__. This might well be the cleaner approach but introduce a bit more abstraction.

What do you think?

felixscherz avatar May 04 '24 12:05 felixscherz

@Fokko Thanks for taking a look! Sorry about the formatting, should be fixed now:)

felixscherz avatar Jul 09 '24 06:07 felixscherz