data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

Apply filter transform on SFT data and convert to jsonl

Open touma-I opened this issue 1 month ago • 4 comments

Why are these changes needed?

Notebook used for filtering olympiads records and conversion to jsonl

Related issue number (if any).

touma-I avatar Oct 28 '25 01:10 touma-I

@touma-I I cleaned up the notebook a little bit and tested it again after the clean-up. I am ready to approve if you want to move it from the Draft.

shahrokhDaijavad avatar Oct 28 '25 15:10 shahrokhDaijavad

@touma-I don't see any data added to this PR, or utilizing test data from elsewhere; how can this be reproduced?

swith005 avatar Nov 11 '25 17:11 swith005

@swith005: I tested this using an internal version of Numinamath data, that I added to the "INPUT_FOLDER". The external dataset (internal version is a subset) from which the internal version was extracted is available on HF here: https://huggingface.co/datasets/AI-MO/NuminaMath-CoT/tree/main/data, but as I look at the original dataset, the schema is different, and our internal team has done some "massaging" of that data. In the HF data, the column that the filter will be applied to is source (instead of _meta_json).

shahrokhDaijavad avatar Nov 11 '25 17:11 shahrokhDaijavad

thanks for clarifying @shahrokhDaijavad . Can you update the notebook to at least start with this external dataset from HF, or least provide a note from where to retrieve for running?

@swith005 I updated the notebook and tested it with a dataset that I downloaded from HF. I have added a comment about how (the HF link) to get that file. Since this notebook uses the filter from release 1.1.5, Maroun is suggesting NOT to merge this PR until after 1.1.6, and at that time, we will update the pip install to 1.1.6.

swith005 avatar Nov 11 '25 18:11 swith005