TinyLlama icon indicating copy to clipboard operation
TinyLlama copied to clipboard

How do you plan on dealing with hallucinations due to knowledge compression?

Open VatsaDev opened this issue 1 year ago • 16 comments

Hi, I'm very interested in this project, but I would like to know how you plan to deal with the amount of hallucinations made having a very high compression ratio, or training tokens to model params? 3T tokens to 1.1B is a far larger compression than 7B params to 2T tokens for llama2?

VatsaDev avatar Sep 07 '23 12:09 VatsaDev

Exploring retrieval augmented generation is on our TODO list!

jzhang38 avatar Sep 07 '23 13:09 jzhang38

RAG would definitely help, but have you considered training the model on data similar to the SQUAD dataset, for familiarity with pulling factual answers from a context, so it would be better suited for RAG?

VatsaDev avatar Sep 07 '23 19:09 VatsaDev

Yes, we are currently reading papers about Retrieval Augmented LM to find out what training/adaptation setup to RAG is better suited for TinyLlama. It we be great if you could provide a pointer or something if you have an idea.

jzhang38 avatar Sep 08 '23 01:09 jzhang38

RAG involves getting text data from documents or vector embeddings, which is great, but it won't work well for the basic text generation model this right now. when you make an official finetune, you would make a tinyLlama-chat version, and in that you could probably implement some training data like squad_v2, because then you could train it on chat data like

question: What is the biggest dinosaur egg ever found?
context: The largest known dinosaur eggs are those of Hypselosaurus priscus (`high ridge lizard'), a 12m (40ft) long titanosaurid which lived about 80 million years ago.
Answer: The largest known dinosaur eggs are those of Hypselosaurus priscus

VatsaDev avatar Sep 08 '23 14:09 VatsaDev

Perhaps something like Toolformer, with special tokens for intermediate tool use and its output, may be feasible.

walking-octopus avatar Sep 09 '23 13:09 walking-octopus

@walking-octopus Toolformer in the way you suggest it might work, but what do mean special tokens?

The steps are

  • it gets a natural language instruction
  • it makes an API Call out of it
  • it sends this to the right app
  • The app sends a response back
  • The models turns this into a Natural language output

Unless you want to wrap the api call in special tokens, there probably isn't any use.

VatsaDev avatar Sep 09 '23 13:09 VatsaDev

@VatsaDev Can you please give some references regarding your expectation about having more hallucination the more data you have? I understand that there are some heuristics (Chinchilla paper) about the right amount of data one needs to train a LLM of specific size, but why are you so sure that they are true (like more than just heuristics)?

artnoage avatar Sep 22 '23 10:09 artnoage

@artnoage I read a paper on arxiv, can't find the link unfortunately. Sorry If I come across as certain, I am referring to it in a similar way to the chinchilla paper, and I put the question like that as this was a couple weeks ago, when I thought saturation seemed more likely than it is now.

VatsaDev avatar Sep 22 '23 21:09 VatsaDev

Yes, we are currently reading papers about Retrieval Augmented LM to find out what training/adaptation setup to RAG is better suited for TinyLlama. It we be great if you could provide a pointer or something if you have an idea.

I think the main thing is instruction tuning first, and maybe add the encoding for multi-turn.

xiaoyunwu avatar Oct 02 '23 17:10 xiaoyunwu

https://github.com/yaodongC/awesome-instruction-dataset @jzhang38 Just in case you did not see this.

xiaoyunwu avatar Oct 02 '23 17:10 xiaoyunwu

@xiaoyunwu, Instruction tuning seems to be good, but one of the main features of TinyLlama is the context size, which I believe is 2048. That probably makes the model a good fit for few-shot/multi-shot instead of zero-shot, maybe even 32 shot, like a mini-gpt3. Do you know of any good datasets for this?

VatsaDev avatar Oct 02 '23 22:10 VatsaDev

instruction tuning is not zero-shot (prompt engineering can be).

xiaoyunwu avatar Oct 02 '23 22:10 xiaoyunwu

@xiaoyunwu Looking at the dataset, I see that its there

VatsaDev avatar Oct 03 '23 19:10 VatsaDev

@VatsaDev Can you please give some references regarding your expectation about having more hallucination the more data you have? I understand that there are some heuristics (Chinchilla paper) about the right amount of data one needs to train a LLM of specific size, but why are you so sure that they are true (like more than just heuristics)?

I also have the same doubt

Luoyingfeng8 avatar Oct 24 '23 03:10 Luoyingfeng8

@Luoyingfeng8 I already responded to this for artonage, and I made this claim several months ago, since then, I've seen several instances of more trained tokens working for better models.

VatsaDev avatar Oct 24 '23 17:10 VatsaDev

You need to release a suffix array of the training corpus to do it properly. This is also useful in designing hypothetical copyright filters.

chadbrewbaker avatar Nov 20 '23 07:11 chadbrewbaker