VILA math dataset incomplete description

math dataset incomplete description

Open hubenjm opened this issue 1 year ago • 2 comments

In https://github.com/Efficient-Large-Model/VILA/blob/main/llava/data/datasets_mixture.py#L171C5-L171C6 the math dataset is described as type 'vflan'. However, in data_prepare/README.md it isn't clear what corresponds to that. I'm guessing it is GSM8K-ScRel-SFT. But the format of the annotation file https://github.com/OFA-Sys/gsm8k-ScRel/blob/main/data/train_use.jsonl does not directly work with the LazyVFlanDataset class (https://github.com/Efficient-Large-Model/VILA/blob/d7d54bc4ca1e582f59516ba2f94a0217ad2430a0/llava/data/dataset.py#L1313), as it expects multiple .pkl files to live inside the data_path directory. Any elaboration on how you formatted the original train_use.jsonl file into .pkl files or if some other approach was used?

May 13 '24 23:05 hubenjm

Hi, thanks for using VILA. If you click the link in data_prepare/README.md, gsm8k-ScRel will refer you to the annotation file. Instance from this file contains one "query" and one "response" fields. We simply format them into the following format: {'id': 0, 'question': 'Q:Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?\nA:', 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72', 'image': []}

May 15 '24 06:05 Seerkfang

Thanks for the clarification. Hopefully you could add these details to the README.md file in a future commit.

Jun 07 '24 05:06 hubenjm

VILA VILA copied to clipboard

math dataset incomplete description

VILA
VILA copied to clipboard