nyx
nyx copied to clipboard
Extend `ExportCfg` to support full data catalogs
High level description
Kedro has a data catalog concept, which is absolutely fantastic to use. In a way, this is what the ExportCfg does but only for saving data, and only in parquet format, and only locally.
The purpose of this ticket is to extend this to be able to load and save many files in a single catalog entry with the same time stamp to make it easy for the engineer to know when each data was generated and what matches which run.
Requirements
- Upon config, it should allow for versioning for saving data in a timestamped folder (MVP, then can be extended to other versioning methodologies)
- The ExportCfg should be renamed to something more relevant for loading and storing.
- It shall support local and S3 protocol for now, nothing else.
- It shall support credentials for S3, like in Kedro
- It shall be possible to load many different files from a given version, contrary to Kedro's catalog.
- It shall support reading any dataframe format that Rust's arrow crate supports (at a minimum parquet and CSV)
- It shall be a serializable structure, either as YAML or as Dhall
Test plans
- Replace all ExportCfg with this new approach
- Ensure that full scenario data can be reloaded from there.
Design
This should also take inspiration from the MetaFile approach used in ANISE to download data behind URLs. I also wonder whether this should be its own crate!
use serde::{Deserialize, Serialize};
use std::collections::BTreeMap;
#[derive(Serialize, Deserialize, Debug)]
pub struct DataCatalogConfig {
pub versioning: bool,
pub storage: StorageConfig,
pub credentials: Option<Credentials>,
pub files: BTreeMap<String, Option<Box<dyn LoadedFile>>>,
}
#[derive(Serialize, Deserialize, Debug)]
pub struct StorageConfig {
pub local_path: Option<String>,
pub s3_path: Option<String>,
}
#[derive(Serialize, Deserialize, Debug)]
pub struct Credentials {
pub aws_access_key_id: String,
pub aws_secret_access_key: String,
}
pub trait LoadedFile {
fn load(&self) -> Result<Box<dyn LoadedFile>, Box<dyn std::error::Error>>;
}