Query regarding FLAN-v2 Sampling Methodology

Open mxziming opened this issue 11 months ago • 1 comments

Hi, I am currently reading your excellent work on Symbol-LLM, and I just wanted to say that's fantastic!

There's an issue I want to clarify: how did you sample the FLAN dataset? I noticed that you mentioned leveraging the FLAN-v2 dataset for general instruction-tuning and obtained the dataset directly following Tulu.

The general data collection contains ∼ 570K samples, which are sourced from the following three parts: (1) Sampled Flan collection (Longpre et al., 2023) of 150K samples. We obtain the collection directly following Tulu (Wang et al., 2023b).

However, I was unable to find detailed information on the specific sampling methodology in paper Tulu, and I wanted to ask if you could provide more clarity on how the sampling process was performed.

May 13 '25 09:05 mxziming

Hi, we just use the sampled version from tuluv1 (100K) and tuluv2 (50K). The detailed datasets can be found in huggingface: https://huggingface.co/datasets/allenai/tulu-v1-sft-mixture https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture

May 17 '25 05:05 xufangzhi