seastar icon indicating copy to clipboard operation
seastar copied to clipboard

file: Generalize in/out streams making

Open xemul opened this issue 2 years ago • 6 comments

In order to make input/output stream for a file there are two helpers -- make_file_(input|output)_stream out there. Both make file-based sink or source, but these sink and source implementations silently assume that the file is a file-on-disk. Strictly speaking, these are compatible with posix_file_impl, but not with generic file. E.g. pipes (if they exist) might break on read-ahead and write-behind.

This patch adds a customization point to file_impl so that custom file implementations could generate specific sink/source for the stream.

One of the next users of it would be S3 file, which supports posix-ish data_source, but will require very specific data_sink with append-only and extra buffering of 5Mb scattered chunks.

Signed-off-by: Pavel Emelyanov [email protected]

xemul avatar Jan 11 '23 09:01 xemul

pipes aren't suitable as files, because our files are non-streaming. You can't implement dma_read (which is pread(2)) on a pipe.

Can't S3 files handle append-only in dma_write? Gather buffers from dma_write and send it to S3 when a full chunk is reached, or even just hold buffers until we get the right order.

I don't have anything specific against the patch, just the general desire to keep interfaces narrow and push complexity to the implementation.

avikivity avatar Jan 11 '23 10:01 avikivity

Can't S3 files handle append-only in dma_write?

That's the biggest problem. Typically output_stream + file_data_sink_impl would push 128-k buffers into S3 file in a form of void* buf, size_t len pair. Next, S3 uploading only happens in 5M+ chunks, thus s3_file_impl::dma_write() needs to keep this data until at least 39 more of such buffers arrive (or final flush takes place). I don't see how to keep void* buf, size_t size inside S3 file without copying, but there's a zero-copy way if S3 provides its own implementation of data_sink_impl.

I don't have anything specific against the patch, just the general desire to keep interfaces narrow and push complexity to the implementation.

The same can be achieved by patching scylla/sstables/ code, specifically the places where it makes output-stream for data and index files out of pre-opened file-s.

xemul avatar Jan 11 '23 11:01 xemul

Can't S3 files handle append-only in dma_write?

That's the biggest problem. Typically output_stream + file_data_sink_impl would push 128-k buffers into S3 file in a form of void* buf, size_t len pair. Next, S3 uploading only happens in 5M+ chunks, thus s3_file_impl::dma_write() needs to keep this data until at least 39 more of such buffers arrive (or final flush takes place). I don't see how to keep void* buf, size_t size inside S3 file without copying, but there's a zero-copy way if S3 provides its own implementation of data_sink_impl.

I see.

I don't have anything specific against the patch, just the general desire to keep interfaces narrow and push complexity to the implementation.

The same can be achieved by patching scylla/sstables/ code, specifically the places where it makes output-stream for data and index files out of pre-opened file-s.

Maybe that's what we should do. Trying to force something that isn't a file to be a file will just distort our notion of what a file is.

avikivity avatar Jan 11 '23 11:01 avikivity

The same can be achieved by patching scylla/sstables/ code, specifically the places where it makes output-stream for data and index files out of pre-opened file-s.

Maybe that's what we should do. Trying to force something that isn't a file to be a file will just distort our notion of what a file is.

we did that for compressed output stream. make_compressed_file_output_stream() creates an output stream with uncompressed chunked size as the buffer size and relies on output_stream itself on buffering the data until it reaches the target chunk size. then compressed_file_data_sink is guaranteed to operate on chunks of target size, and only the last one can be smaller than that. seems like good enough for s3, right?

raphaelsc avatar Jan 11 '23 13:01 raphaelsc

Yes, it works perfectly for S3. And this is how it's currently implemented in my branch, I just though that it could be nice generalization for seastar.

xemul avatar Jan 11 '23 14:01 xemul

Linux does this multiple layers of customization points. Maybe it can make sense for zoned devices too.

avikivity avatar Jan 12 '23 10:01 avikivity