delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

feat: add hint to support parquet with nanosecond timestamps

Open hntd187 opened this issue 2 years ago • 3 comments

Description

This gives a hint on the schema convert telling convert to delta that the timestamps are nanosecond and need to be converted to the appropriate microseconds, at which point the schema written should use the standard SchemaDataType::primitive("timestamp") style.

Related Issue(s)

For #1721 @junjunjd

Documentation

hntd187 avatar Oct 28 '23 14:10 hntd187

Thanks for the pull request, I'm still mulling this over and kind of hoping somebody else chimes in with an opinion on the approach :thinking:

rtyler avatar Oct 30 '23 18:10 rtyler

My own two cents is this is a very meh solution. But it's the only obvious way I saw to carry some context to someone converting an arrow schema to do a conver_to_delta It may make more sense for the conversion to delta to not use the standard schema conversion facilities and have something more purpose fit that gives the context on what if anything has to be converted.

hntd187 avatar Oct 30 '23 20:10 hntd187

What about modifying the conversion function to something like

impl TryFrom<&ArrowField> for schema::SchemaField {
    type Error = ArrowError;
    fn try_from(arrow_field: &ArrowField) -> Result<Self, ArrowError> {
        let mut metadata: HashMap<String, serde_json::Value> = arrow_field
            .metadata()
            .iter()
            .map(|(k, v)| (k.clone(), serde_json::Value::String(v.clone())))
            .collect();
        match arrow_field.data_type() {
            ArrowDataType::Timestamp(TimeUnit::Nanosecond, _) => {
                metadata.insert("_delta-rs.timestamp.convert".into(), "nano".into());
            }
            _ => {}
        };

        Ok(schema::SchemaField::new(
            arrow_field.name().clone(),
            arrow_field.data_type().try_into()?,
            arrow_field.is_nullable(),
            arrow_field
                .metadata()
                .iter()
                .map(|(k, v)| (k.clone(), serde_json::Value::String(v.clone())))
                .collect(),
        ))
    }
}

The magic string _delta-rs.timestamp.convert could come from a well known set of metadata that we use internally. It can be used for writing and value conversion, or any other thing someone can think of, but it should be ignored when read back from the table since it will have lost its value.

mightyshazam avatar Nov 02 '23 13:11 mightyshazam