Query regarding FLAN-v2 Sampling Methodology
Hi, I am currently reading your excellent work on Symbol-LLM, and I just wanted to say that's fantastic!
There's an issue I want to clarify: how did you sample the FLAN dataset? I noticed that you mentioned leveraging the FLAN-v2 dataset for general instruction-tuning and obtained the dataset directly following Tulu.
The general data collection contains ∼ 570K samples, which are sourced from the following three parts: (1) Sampled Flan collection (Longpre et al., 2023) of 150K samples. We obtain the collection directly following Tulu (Wang et al., 2023b).
However, I was unable to find detailed information on the specific sampling methodology in paper Tulu, and I wanted to ask if you could provide more clarity on how the sampling process was performed.
Hi, we just use the sampled version from tuluv1 (100K) and tuluv2 (50K). The detailed datasets can be found in huggingface: https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture