delta-rs
delta-rs copied to clipboard
feat: add hint to support parquet with nanosecond timestamps
Description
This gives a hint on the schema convert telling convert to delta that the timestamps are nanosecond and need to be converted to the appropriate microseconds, at which point the schema written should use the standard SchemaDataType::primitive("timestamp") style.
Related Issue(s)
For #1721 @junjunjd
Documentation
Thanks for the pull request, I'm still mulling this over and kind of hoping somebody else chimes in with an opinion on the approach :thinking:
My own two cents is this is a very meh solution. But it's the only obvious way I saw to carry some context to someone converting an arrow schema to do a conver_to_delta It may make more sense for the conversion to delta to not use the standard schema conversion facilities and have something more purpose fit that gives the context on what if anything has to be converted.
What about modifying the conversion function to something like
impl TryFrom<&ArrowField> for schema::SchemaField {
type Error = ArrowError;
fn try_from(arrow_field: &ArrowField) -> Result<Self, ArrowError> {
let mut metadata: HashMap<String, serde_json::Value> = arrow_field
.metadata()
.iter()
.map(|(k, v)| (k.clone(), serde_json::Value::String(v.clone())))
.collect();
match arrow_field.data_type() {
ArrowDataType::Timestamp(TimeUnit::Nanosecond, _) => {
metadata.insert("_delta-rs.timestamp.convert".into(), "nano".into());
}
_ => {}
};
Ok(schema::SchemaField::new(
arrow_field.name().clone(),
arrow_field.data_type().try_into()?,
arrow_field.is_nullable(),
arrow_field
.metadata()
.iter()
.map(|(k, v)| (k.clone(), serde_json::Value::String(v.clone())))
.collect(),
))
}
}
The magic string _delta-rs.timestamp.convert could come from a well known set of metadata that we use internally. It can be used for writing and value conversion, or any other thing someone can think of, but it should be ignored when read back from the table since it will have lost its value.