xet-core
xet-core copied to clipboard
refactor data processing
Refactor data processing to
- change clean API as a non-async-iterator buffer based API (can drop async next as we drop async in underlying crates), usage:
let pft = PointerFileTranslatorV3::new(config).await;
/* ----------- Clean file 1 (can safely spawn into another thread) ----------- */
let cleaner = pft.start_clean(4096 /*buffer size*/, Some(path1)).await?;
while let Some(data) = read_file(&mut reader1) {
cleaner.add_bytes(data).await?;
}
let cleaned_result = cleaner.result().await;
/* ----------- Clean file 2 (can safely spawn into another thread) ----------- */
let cleaner = pft.start_clean(4096 /*buffer size*/, Some(path2)).await?;
while let Some(data) = read_file(&mut reader2) {
cleaner.add_bytes(data).await?;
}
let cleaned_result = cleaner.result().await;
/* ----------- Finish ----------- */
pft.finalize_cleaning().await
For example, see https://github.com/xetdata/xet-core/blob/2c945292caaa3f57d2742295f4604f6c417c8d6b/rust/gitxetcore/src/data/data_processing_v3.rs#L521
-
drop XetConfig dependency. Right now there are some helper functions to map XetConfig to new configurations (see https://github.com/xetdata/xet-core/blob/a36954d06bba0cf49838f11c5ab500e894b5177f/rust/gitxetcore/src/data/configurations.rs#L124), these are just for testing the correctness of the new data processing logic using the existing test set up.
-
make repo salt optional for dedup
All integration tests pass. Same clean speed as before.