TinyLlama
TinyLlama copied to clipboard
How do you plan on dealing with hallucinations due to knowledge compression?
Hi, I'm very interested in this project, but I would like to know how you plan to deal with the amount of hallucinations made having a very high compression ratio, or training tokens to model params? 3T tokens to 1.1B is a far larger compression than 7B params to 2T tokens for llama2?
Exploring retrieval augmented generation is on our TODO list!
RAG would definitely help, but have you considered training the model on data similar to the SQUAD dataset, for familiarity with pulling factual answers from a context, so it would be better suited for RAG?
Yes, we are currently reading papers about Retrieval Augmented LM to find out what training/adaptation setup to RAG is better suited for TinyLlama. It we be great if you could provide a pointer or something if you have an idea.
RAG involves getting text data from documents or vector embeddings, which is great, but it won't work well for the basic text generation model this right now. when you make an official finetune, you would make a tinyLlama-chat
version, and in that you could probably implement some training data like squad_v2, because then you could train it on chat data like
question: What is the biggest dinosaur egg ever found?
context: The largest known dinosaur eggs are those of Hypselosaurus priscus (`high ridge lizard'), a 12m (40ft) long titanosaurid which lived about 80 million years ago.
Answer: The largest known dinosaur eggs are those of Hypselosaurus priscus
Perhaps something like Toolformer, with special tokens for intermediate tool use and its output, may be feasible.
@walking-octopus Toolformer in the way you suggest it might work, but what do mean special tokens?
The steps are
- it gets a natural language instruction
- it makes an API Call out of it
- it sends this to the right app
- The app sends a response back
- The models turns this into a Natural language output
Unless you want to wrap the api call in special tokens, there probably isn't any use.
@VatsaDev Can you please give some references regarding your expectation about having more hallucination the more data you have? I understand that there are some heuristics (Chinchilla paper) about the right amount of data one needs to train a LLM of specific size, but why are you so sure that they are true (like more than just heuristics)?
@artnoage I read a paper on arxiv, can't find the link unfortunately. Sorry If I come across as certain, I am referring to it in a similar way to the chinchilla paper, and I put the question like that as this was a couple weeks ago, when I thought saturation seemed more likely than it is now.
Yes, we are currently reading papers about Retrieval Augmented LM to find out what training/adaptation setup to RAG is better suited for TinyLlama. It we be great if you could provide a pointer or something if you have an idea.
I think the main thing is instruction tuning first, and maybe add the encoding for multi-turn.
https://github.com/yaodongC/awesome-instruction-dataset @jzhang38 Just in case you did not see this.
@xiaoyunwu, Instruction tuning seems to be good, but one of the main features of TinyLlama is the context size, which I believe is 2048. That probably makes the model a good fit for few-shot/multi-shot instead of zero-shot, maybe even 32 shot, like a mini-gpt3. Do you know of any good datasets for this?
instruction tuning is not zero-shot (prompt engineering can be).
@xiaoyunwu Looking at the dataset, I see that its there
@VatsaDev Can you please give some references regarding your expectation about having more hallucination the more data you have? I understand that there are some heuristics (Chinchilla paper) about the right amount of data one needs to train a LLM of specific size, but why are you so sure that they are true (like more than just heuristics)?
I also have the same doubt
@Luoyingfeng8 I already responded to this for artonage, and I made this claim several months ago, since then, I've seen several instances of more trained tokens working for better models.
You need to release a suffix array of the training corpus to do it properly. This is also useful in designing hypothetical copyright filters.