nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

How to make sentences make more sense?

Open gitihobo opened this issue 2 years ago • 10 comments

Is it the amount of iterations? How do I add sense and variety to my llm?

gitihobo avatar Jun 25 '23 04:06 gitihobo

the same confusion。add data ? add layers? what's the smallest layer count?

JKHenry520 avatar Jun 26 '23 03:06 JKHenry520

Hope someone helps us soon

gitihobo avatar Jun 26 '23 18:06 gitihobo

how much sense do you expect? there's some ideas in the TinyStories paper: https://arxiv.org/abs/2305.07759

their dataset is here: https://huggingface.co/datasets/roneneldan/TinyStories

i have used it to pre-train and it's definitely improved the models (at expense of much compute time..)

the-crypt-keeper avatar Jun 29 '23 22:06 the-crypt-keeper

and what are we supposed to do to in terms of settings to archive similar results?

gitihobo avatar Jun 29 '23 22:06 gitihobo

This bug affects the quality negatively: https://github.com/karpathy/nanoGPT/issues/320

Majdoddin avatar Jun 30 '23 10:06 Majdoddin

GPT-2 is glorified auto complete with the ability to make sentences, If you want better sentences, fine tune it. I have personally had pretty good success with finetuning gpt-2-medium into making conversation, sentences, and even small paragraphs.

VatsaDev avatar Aug 22 '23 00:08 VatsaDev

So how do you finetune it?

gitihobo avatar Aug 22 '23 06:08 gitihobo

theres the Finetuning section In the readme, read that, but the command is $ python train.py config/finetune_shakespeare.py

VatsaDev avatar Aug 22 '23 12:08 VatsaDev

Thank you, know I do know how to fine tune, what I am not sure about is the data, how do I get the amount of text necessary and how do I have to format it to make a good fine tune?

gitihobo avatar Aug 22 '23 20:08 gitihobo

I have addressed many of these issues in my repo NanoChatGPT, all the details are in the README. I formatted my data like this

<human> ... <endOfText>
<Bot> ... <endOfText>
<human> ... <endOfText>
<Bot> ... <endOfText>
<human> ... <endOfText>
<Bot> ... <endOfText>

since my data was conversational, I took conversation corpuses, the whole list is on my repo readme, but one dataset I found to be pretty great was the personachat dataset

VatsaDev avatar Aug 23 '23 14:08 VatsaDev