dsir icon indicating copy to clipboard operation
dsir copied to clipboard

How to Calculate KL Reduction ?

Open GenerallyCovetous opened this issue 3 months ago • 0 comments

Can the DSIR calculate the data metric method mentioned in the paper—KL reduction? And what are the necessary data preprocessing methods when resampling a custom dataset? My scenario involves importance resampling of data in the Alpaca style, and my current processing code is as follows:

from data_selection import HashedNgramDSIR

raw_datasets = ["/dsir/original_data/train_30k.jsonl"]
target_datasets = ["/dsir/original_data/target.jsonl"]

dsir = HashedNgramDSIR(raw_datasets, target_datasets, cache_dir='/dsir/dsir_cache')
dsir.fit_importance_estimator(num_tokens_to_fit='auto')
dsir.compute_importance_weights()
dsir.resample(out_dir='resampled', num_to_sample=10000, cache_dir='/dsir/resampled_cache')

GenerallyCovetous avatar Sep 11 '25 11:09 GenerallyCovetous