xet-core icon indicating copy to clipboard operation
xet-core copied to clipboard

refactor data processing

Open seanses opened this issue 1 year ago • 0 comments

Refactor data processing to

  1. change clean API as a non-async-iterator buffer based API (can drop async next as we drop async in underlying crates), usage:
let pft = PointerFileTranslatorV3::new(config).await;

/* ----------- Clean file 1 (can safely spawn into another thread) ----------- */
let cleaner = pft.start_clean(4096 /*buffer size*/, Some(path1)).await?;
while let Some(data) =  read_file(&mut reader1) {
    cleaner.add_bytes(data).await?;
}
let cleaned_result = cleaner.result().await;

/* ----------- Clean file 2 (can safely spawn into another thread)  ----------- */
let cleaner = pft.start_clean(4096 /*buffer size*/, Some(path2)).await?;
while let Some(data) =  read_file(&mut reader2) {
    cleaner.add_bytes(data).await?;
}
let cleaned_result = cleaner.result().await;

/* ----------- Finish ----------- */
pft.finalize_cleaning().await

For example, see https://github.com/xetdata/xet-core/blob/2c945292caaa3f57d2742295f4604f6c417c8d6b/rust/gitxetcore/src/data/data_processing_v3.rs#L521

  1. drop XetConfig dependency. Right now there are some helper functions to map XetConfig to new configurations (see https://github.com/xetdata/xet-core/blob/a36954d06bba0cf49838f11c5ab500e894b5177f/rust/gitxetcore/src/data/configurations.rs#L124), these are just for testing the correctness of the new data processing logic using the existing test set up.

  2. make repo salt optional for dedup

All integration tests pass. Same clean speed as before.

seanses avatar Aug 29 '24 20:08 seanses