gpt-neox icon indicating copy to clipboard operation
gpt-neox copied to clipboard

Add FLAN and T0 finetuning data

Open StellaAthena opened this issue 3 years ago • 2 comments

Is your feature request related to a problem? Please describe. FLAN and T0 are two frameworks for finetuning language models on task-structured data. Both papers show significant improvement in LM capabilities when finetuned on their datasets, which may prove useful to us. Additionally, I want to do experiments comparing the two methodologies.

Describe the solution you'd like Process the data in a megatron-compliant fashion and create downloaders for each dataset.

StellaAthena avatar Dec 31 '21 17:12 StellaAthena

@uSaiPrashanth is working on T0 @vaibhavs10 is working on FLAN

StellaAthena avatar Dec 31 '21 17:12 StellaAthena

Update: I am currently working on grabbing data from p3 and trying to shape it in a format accepted by neox. The plan is to concatenate input and target of each prompt and save it in a jsonl format. Following that, the data will be preprocessed using tools/preprocess_data.py and would be converted to a version compatible with megatron

uSaiPrashanth avatar Jan 05 '22 16:01 uSaiPrashanth