lit-llama Best way to fine tune on Wiki Data

I want to fine-tune LLaMa on data I got from a fandom wiki (for example this page) and was wondering how to design the json file with its "prompt", "input", and "output"?

I can't just only use the prompt "Write the next sentance" and then put two adjacent sentences in input and output, right?

Jun 05 '23 15:06 JulianBvW

One way would be to use the Dolly 2.0 JSON file as a template and structure your dataset in the same fashion, using the same keys. And then run the prepare_dolly script. Screenshot 2023-06-05 at 10 47 37 AM

Jun 05 '23 15:06 rasbt

Or, maybe even easier would be to structure it similar to the Alpaca dataset, which has slightly different names for the keys, and then use the prepare_alpaca script.

Jun 05 '23 15:06 rasbt

Yes, but the question is how i can automatically fill instruction, input, and output using the web-scraped texts from the wiki pages?

Jun 05 '23 16:06 JulianBvW

I'm also looking for the best way of creating dataset. I suppose we have to manually create some dataset (instructions/ output) manually and then can use self instruct to expand this and use for training.

I'm not sure how much data do we need to create and what should be the length of each instruction, response.

Is there a more systematic way of creating manual data?

Jun 06 '23 03:06 asadabbas09

Yes, but the question is how i can automatically fill instruction, input, and output using the web-scraped texts from the wiki pages?

Oh, I think I now understand what you mean. Essentially, you don't have an instruction-finetuning dataset, correct? Or in other words, it's an "unlabeled" dataset? One way would be creating an instruction dataset by imitation learning; this would involve using another LLM (like GPT-4, e.g., via the API) to generate an instruction dataset from your dataset. This is essentially how the Alpaca dataset itself was created as well (for more details: https://github.com/tatsu-lab/stanford_alpaca#data-generation-process)

Or, if you are not interested in instruction-finetuning, I guess you could use your dataset with the pretraining script to further train the model via next-word prediction on the custom dataset.

Jun 06 '23 16:06 rasbt

I'm very new to this, so I apologize if it's incorrect, but I believe you can just follow the unstructured data guide or adapt the prepare_shakespeare.py code.

Jun 08 '23 03:06 JuicyStandoffishMan

lit-llama lit-llama copied to clipboard

Best way to fine tune on Wiki Data

lit-llama
lit-llama copied to clipboard