delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Support deletion vector

Open wjones127 opened this issue 2 years ago • 12 comments

Description

For protocol version 3, will want to support deletion vector.

  • [ ] Supporting reading with deletion vector
  • [ ] Support delete operations using delete vector

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors

Question: how do we decide to rewrite vs use delete vector?

Use Case

This enables much faster deletes.

Related Issue(s)

Prerequisites:

  • #930
  • #832

wjones127 avatar Jan 24 '23 02:01 wjones127

Question: how do we decide to rewrite vs use delete vector?

This looks like a tradeoff between faster read performance v.s. faster write that need to be decided case by case? If so, might be better to just let the user decide depending on the expected workload pattern.

houqp avatar Jan 25 '23 04:01 houqp

+1 to supporting user-owned tradeoff decision. I'm investigating this feature internally and update patterns in individual tables likely dictate the right decision.

For instance, in many dimension tables, edits may be spread randomly through existing data and merge on read will be more efficient. For fact tables with mostly append pattern (but occasional fact updates), judicious partition plus copy on write may be superior.

guyrt avatar Feb 06 '23 06:02 guyrt

Don't know if this helps, just tried to read a deletion vector file, and this seems to be working with the roaring crate:



fn get_deletion_vectors(
    filename: &str,
) -> Result<Vec<RoaringTreemap>, Box<dyn std::error::Error + Send + Sync>> {
    let mut file = File::open(filename)?;
    let mut buf = vec![0; 2];
    file.read(&mut buf).unwrap();
    let version = u16::from_le_bytes(buf.clone().try_into().unwrap());
    assert_eq!(version, 1);
    let mut index = 0;
    let mut vec = Vec::new();
    loop {
        index += 1;
        let mut buf = vec![0; 3];
        let nrread = file.read(&mut buf)?;
        if nrread == 0 {
            return Ok(vec);
        }

        let size_buf = [&[0], &buf[0..3]].concat();
        let datasize = u32::from_be_bytes(size_buf.try_into().unwrap());
        let mut buf = vec![0; 4];
        file.read(&mut buf)?;
        let magic = i32::from_le_bytes(buf.clone().try_into().unwrap());

        assert!(magic == 1681511377);
        if datasize == 0 {
            continue;
        }

        let before = &file.stream_position()?;
        let take: Take<&File> = (&file).take(datasize as u64 - 4);
        let rdr = RoaringTreemap::deserialize_from(take)?;

        //let mut target_file =
        //    File::create("data/deletion_vectors_splitted/delvec_".to_owned() + &index.to_string())?;
        //std::io::copy(&mut take, &mut target_file)?;

        let after = &file.stream_position()?;
        //println!("{}, {}: {}", before, after, datasize);

        vec.push(rdr);
        // seems roaring-rs does not always read to full end
        let mut buf = vec![0; 1];
        file.read(&mut buf)?;

        let mut checksum_buf = vec![0; 4];
        file.read(&mut checksum_buf)?;
    }
}

aersam avatar Jul 05 '23 06:07 aersam

Would you accept a PR that does add the required metadata as a first step?

aersam avatar Jul 10 '23 14:07 aersam

Hi @aersam - first of all thanks for the code snipplet, it actually samed me a bit of time working on this elsewhere.

In principle we always welcome contributions. In this case we also do, but there is one caveat. Elsewhere we are currently working hard on getting delta-kernel for rust released which will hopefully significantly boost our protocol support.

The more complex thing here is, that in order to support deletion vectors we have to either support reader V3 and writer v7 (i.e. table features), or support a whole bunch of other delta features as well.

Good news is we are actively working on it, but since this involves some larger blocks of work, its likely going to be a few weeks, before this can fully land...

With all that said, if you profit from having some intermediate partial support, I'd be happy to review PRs :)

roeap avatar Jul 10 '23 15:07 roeap

Well if it's about weeks I can wait. I know that actually column mapping would be first, just thought that cannot be that hard ;)

I did not know about delta-kernel for rust, I'm really glad to hear about it! To be honest I was a bit disappointed as I thought it will be in Java - nothing against Java, but I much prefer Rust, especially for embedding. Where do I find the code for delta-kernel/rust? Just to observe it a bit

Btw I also corrected the snipped, it had a bug when there are multiple vectors within file.

aersam avatar Jul 10 '23 15:07 aersam

@roeap where can one follow the Delta kernel initiatives? I saw https://github.com/delta-io/delta/issues/1783 but that's not rust specific, right? Will it happen in this repo or will there be a delta-kernel-rs?

alippai avatar Jul 10 '23 17:07 alippai

Trying to get the metadata running here: https://github.com/bmsuisse/delta-rs/tree/deletion_vector_meta Once you have the metadata you could use them for example together with duckdb's read_parquet([parquets...],file_row_number=True) to read tables with deletion vectors

aersam avatar Aug 01 '23 15:08 aersam

fwiw; Fabric Datawarehouse just added support for deletion vectors and suddenly the delta table produced is no more compatible with Delta_rs :(

djouallah avatar Nov 02 '23 01:11 djouallah