Apply filter transform on SFT data and convert to jsonl
Why are these changes needed?
Notebook used for filtering olympiads records and conversion to jsonl
Related issue number (if any).
@touma-I I cleaned up the notebook a little bit and tested it again after the clean-up. I am ready to approve if you want to move it from the Draft.
@touma-I don't see any data added to this PR, or utilizing test data from elsewhere; how can this be reproduced?
@swith005: I tested this using an internal version of Numinamath data, that I added to the "INPUT_FOLDER". The external dataset (internal version is a subset) from which the internal version was extracted is available on HF here: https://huggingface.co/datasets/AI-MO/NuminaMath-CoT/tree/main/data, but as I look at the original dataset, the schema is different, and our internal team has done some "massaging" of that data. In the HF data, the column that the filter will be applied to is source (instead of _meta_json).
thanks for clarifying @shahrokhDaijavad . Can you update the notebook to at least start with this external dataset from HF, or least provide a note from where to retrieve for running?
@swith005 I updated the notebook and tested it with a dataset that I downloaded from HF. I have added a comment about how (the HF link) to get that file. Since this notebook uses the filter from release 1.1.5, Maroun is suggesting NOT to merge this PR until after 1.1.6, and at that time, we will update the pip install to 1.1.6.