vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Can the file writer provide an explicit flush interface?

Open jiaqizho opened this issue 2 months ago • 3 comments

Hi, I'm building a storage engine using Vortex as a file format. As the caller, I want to be able to control writer memory.

my scenario is as follows:

  1. In the same thread, opened multiple vortex writers(VortexWriteOptions)
  2. write the arow::Recordbatche to multiple writers
  3. When a memory limit is hit (hard-code setting), I want to flush some of vortex writers without closing them.

So i guess a explicit flush interface is useful.

jiaqizho avatar Oct 13 '25 04:10 jiaqizho

You should be able to use this function to create a push-based Writer. You can then flush your own Write object as you wish.

Does that help?

Note that this doesn't force a flush of any internally buffered arrays in the writer. The default strategy currently buffers up to 2MB per column to improve column locality in the output file. You may want to use a custom strategy to prevent this.

gatesn avatar Oct 13 '25 18:10 gatesn

You should be able to use this function to create a push-based Writer. You can then flush your own Write object as you wish.

Does that help?

Note that this doesn't force a flush of any internally buffered arrays in the writer. The default strategy currently buffers up to 2MB per column to improve column locality in the output file. You may want to use a custom strategy to prevent this. Thanks for replay

my example is:

let chunk1 =
	StructArray::from_fields(&[("numbers", buffer!(1u32; 1024 * 1024 * 16).into_array())])?.into_array();
let chunk2 =
	StructArray::from_fields(&[("numbers", buffer!(2u32; 1024 * 1024 * 16).into_array())])?.into_array();
let chunk3 =
    StructArray::from_fields(&[("numbers", buffer!(3u32; 1024 * 1024 * 16).into_array())])?.into_array();

let dtype = chunk1.dtype().clone();

let mut buf = ByteBufferMut::empty();
{
	let mut writer = VortexWriteOptions::default()
            .blocking::<CurrentThreadRuntime>()
            .writer(&mut buf, dtype.clone());

    writer.push(chunk1)?;
    // <------i want call `write.flush()?` here
    writer.push(chunk2)?;
    writer.push(chunk3)?;

    let _ = writer.finish()?;
}

You can then flush your own Write object as you wish. Question is: the Write will be the inner object of writer, how can i call the flush in Write?

jiaqizho avatar Oct 14 '25 02:10 jiaqizho

That's a very good point... It's not possible with our current logic to grab hold of the writer in the middle to flush it.

We may need to change the writer to be push-based by default, and spawn a task to push a stream of arrays into it. Instead of defaulting to a stream-based and trying to wrap that up in a push-based API.

I wonder if @onursatici has any thoughts?

gatesn avatar Oct 15 '25 03:10 gatesn