lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

Best way to fine tune on Wiki Data

Open JulianBvW opened this issue 2 years ago • 6 comments

I want to fine-tune LLaMa on data I got from a fandom wiki (for example this page) and was wondering how to design the json file with its "prompt", "input", and "output"?

I can't just only use the prompt "Write the next sentance" and then put two adjacent sentences in input and output, right?

JulianBvW avatar Jun 05 '23 15:06 JulianBvW

One way would be to use the Dolly 2.0 JSON file as a template and structure your dataset in the same fashion, using the same keys. And then run the prepare_dolly script. Screenshot 2023-06-05 at 10 47 37 AM

rasbt avatar Jun 05 '23 15:06 rasbt

Or, maybe even easier would be to structure it similar to the Alpaca dataset, which has slightly different names for the keys, and then use the prepare_alpaca script.

Screenshot 2023-06-05 at 10 52 34 AM

rasbt avatar Jun 05 '23 15:06 rasbt

Yes, but the question is how i can automatically fill instruction, input, and output using the web-scraped texts from the wiki pages?

JulianBvW avatar Jun 05 '23 16:06 JulianBvW

I'm also looking for the best way of creating dataset. I suppose we have to manually create some dataset (instructions/ output) manually and then can use self instruct to expand this and use for training.

I'm not sure how much data do we need to create and what should be the length of each instruction, response.

Is there a more systematic way of creating manual data?

asadabbas09 avatar Jun 06 '23 03:06 asadabbas09

Yes, but the question is how i can automatically fill instruction, input, and output using the web-scraped texts from the wiki pages?

Oh, I think I now understand what you mean. Essentially, you don't have an instruction-finetuning dataset, correct? Or in other words, it's an "unlabeled" dataset? One way would be creating an instruction dataset by imitation learning; this would involve using another LLM (like GPT-4, e.g., via the API) to generate an instruction dataset from your dataset. This is essentially how the Alpaca dataset itself was created as well (for more details: https://github.com/tatsu-lab/stanford_alpaca#data-generation-process)

Or, if you are not interested in instruction-finetuning, I guess you could use your dataset with the pretraining script to further train the model via next-word prediction on the custom dataset.

rasbt avatar Jun 06 '23 16:06 rasbt

I'm very new to this, so I apologize if it's incorrect, but I believe you can just follow the unstructured data guide or adapt the prepare_shakespeare.py code.

JuicyStandoffishMan avatar Jun 08 '23 03:06 JuicyStandoffishMan