iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

Implement the equality delete writer

Open ZENOTME opened this issue 1 year ago • 5 comments

After we finish https://github.com/apache/iceberg-rust/pull/275, we can implement the equality delete writer based on this framework.

There is a rust implementation that can be referred to in icelake. But better design is acceptable.

related spec: https://iceberg.apache.org/spec/#equality-delete-files

ZENOTME avatar Apr 23 '24 02:04 ZENOTME

Hi @ZENOTME, Maybe I can take this issue after you complete https://github.com/apache/iceberg-rust/issues/345

Dysprosium0626 avatar Apr 24 '24 07:04 Dysprosium0626

Hi @ZENOTME, Maybe I can take this issue after you complete #345

Sure! Thanks!

ZENOTME avatar Apr 24 '24 07:04 ZENOTME

Assigned to you, thanks @Dysprosium0626 !

liurenjie1024 avatar Apr 25 '24 10:04 liurenjie1024

Hi I nearly complete adding EqualityDeleteWriter but I encounter some problem. My impl is here: https://github.com/Dysprosium0626/iceberg-rust/blob/add_equality_delete_writer/crates/iceberg/src/writer/base_writer/equality_delete_writer.rs

Basically, in my test case, I write some schema to build up a ParquetWriterBuilder and pass it into EqualityDeleteFileWriterBuilder.

        // prepare writer
        let pb = ParquetWriterBuilder::new(
            WriterProperties::builder().build(),
            to_write.schema(),
            file_io.clone(),
            location_gen,
            file_name_gen,
        );
        let equality_ids = vec![1, 3];
        let mut equality_delete_writer = EqualityDeleteFileWriterBuilder::new(pb)
            .build(EqualityDeleteWriterConfig::new(
                equality_ids,
                schema.clone(),
                PARQUET_FIELD_ID_META_KEY,
            ))
            .await?;

The FieldProjector will filter columns in schema by the equality_ids and I tried to generate a delete_schema with fields after projection.

    async fn build(self, config: Self::C) -> Result<Self::R> {
        let (projector, fields) = FieldProjector::new(
            config.schema.fields(),
            &config.equality_ids,
            &config.column_id_meta_key,
        )?;
        let delete_schema = Arc::new(arrow_schema::Schema::new(fields));
        Ok(EqualityDeleteFileWriter {
            inner_writer: Some(self.inner.clone().build().await?),
            projector,
            delete_schema,
            equality_ids: config.equality_ids,
        })
    }

The problem is I cannot pass the delete_schema to FileWriterBuilder(ParquetWriterBuilder in this case), and the schema for inner writer is the old version(without projection), so the inner writer canno write file with properly. Do you have any ideas? @ZENOTME

Dysprosium0626 avatar May 04 '24 14:05 Dysprosium0626

Thanks! @Dysprosium0626 Sorry for replying late. Our original idea here is to construct the delete schema outside the EqualityDeleteFileWriter.

 let equality_ids = vec![1, 3];
 let delete_schema = ...;
 let pb = ParquetWriterBuilder::new(
            WriterProperties::builder().build(),
            delete_schema,
            file_io.clone(),
            location_gen,
            file_name_gen,
 );
 let mut equality_delete_writer = EqualityDeleteFileWriterBuilder::new(pb)
          .build(EqualityDeleteWriterConfig::new(
                equality_ids,
                PARQUET_FIELD_ID_META_KEY,
            ))
            .await?;

Looks like the schema always can be determined before we build the writer rather than "run time".

ZENOTME avatar May 06 '24 10:05 ZENOTME