seastar
seastar copied to clipboard
file: Generalize in/out streams making
In order to make input/output stream for a file there are two helpers -- make_file_(input|output)_stream out there. Both make file-based sink or source, but these sink and source implementations silently assume that the file is a file-on-disk. Strictly speaking, these are compatible with posix_file_impl, but not with generic file. E.g. pipes (if they exist) might break on read-ahead and write-behind.
This patch adds a customization point to file_impl so that custom file implementations could generate specific sink/source for the stream.
One of the next users of it would be S3 file, which supports posix-ish data_source, but will require very specific data_sink with append-only and extra buffering of 5Mb scattered chunks.
Signed-off-by: Pavel Emelyanov [email protected]
pipes aren't suitable as files, because our files are non-streaming. You can't implement dma_read (which is pread(2)) on a pipe.
Can't S3 files handle append-only in dma_write? Gather buffers from dma_write and send it to S3 when a full chunk is reached, or even just hold buffers until we get the right order.
I don't have anything specific against the patch, just the general desire to keep interfaces narrow and push complexity to the implementation.
Can't S3 files handle append-only in dma_write?
That's the biggest problem. Typically output_stream + file_data_sink_impl would push 128-k buffers into S3 file in a form of void* buf, size_t len
pair. Next, S3 uploading only happens in 5M+ chunks, thus s3_file_impl::dma_write() needs to keep this data until at least 39 more of such buffers arrive (or final flush takes place). I don't see how to keep void* buf, size_t size
inside S3 file without copying, but there's a zero-copy way if S3 provides its own implementation of data_sink_impl.
I don't have anything specific against the patch, just the general desire to keep interfaces narrow and push complexity to the implementation.
The same can be achieved by patching scylla/sstables/ code, specifically the places where it makes output-stream for data and index files out of pre-opened file-s.
Can't S3 files handle append-only in dma_write?
That's the biggest problem. Typically output_stream + file_data_sink_impl would push 128-k buffers into S3 file in a form of
void* buf, size_t len
pair. Next, S3 uploading only happens in 5M+ chunks, thus s3_file_impl::dma_write() needs to keep this data until at least 39 more of such buffers arrive (or final flush takes place). I don't see how to keepvoid* buf, size_t size
inside S3 file without copying, but there's a zero-copy way if S3 provides its own implementation of data_sink_impl.
I see.
I don't have anything specific against the patch, just the general desire to keep interfaces narrow and push complexity to the implementation.
The same can be achieved by patching scylla/sstables/ code, specifically the places where it makes output-stream for data and index files out of pre-opened file-s.
Maybe that's what we should do. Trying to force something that isn't a file to be a file will just distort our notion of what a file is.
The same can be achieved by patching scylla/sstables/ code, specifically the places where it makes output-stream for data and index files out of pre-opened file-s.
Maybe that's what we should do. Trying to force something that isn't a file to be a file will just distort our notion of what a file is.
we did that for compressed output stream. make_compressed_file_output_stream() creates an output stream with uncompressed chunked size as the buffer size and relies on output_stream itself on buffering the data until it reaches the target chunk size. then compressed_file_data_sink is guaranteed to operate on chunks of target size, and only the last one can be smaller than that. seems like good enough for s3, right?
Yes, it works perfectly for S3. And this is how it's currently implemented in my branch, I just though that it could be nice generalization for seastar.
Linux does this multiple layers of customization points. Maybe it can make sense for zoned devices too.