alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

How to perform finetuning on very long inputs

Open SilenceGTX opened this issue 1 year ago • 7 comments

This might be a silly question, but really would like to learn how to perform finetuning on very long inputs. For example, say I got a bunch of documentation, each might be hundreds of tokens long. If I would like to build a QA based on the documentation, how should I perpair the dataset? I assume I cannot simply concat all the documentation together right (as it will be way longer than cutoff_len)? If I feed the documentation one by one, then how should I deal with the situation that some one asking about documentation A, but the input documentation is B? Is that a proper way or best practice to deal with this? Thanks.

SilenceGTX avatar Mar 24 '23 23:03 SilenceGTX

Definitely not a silly question, there are multiple projects dedicated to doing similar things. You can't really get that result by fine tuning. You just need to leverage it's ability to understand and process information.

You can start by using LlamaIndex, which queries the LLM via Langchain, which can support Llama either through Basaran or by writing your own API adapter. I made it work with the text-generation-webui api.

It's not as hard as it sounds to setup... but the results were not great IMO. I played with it for a bit, but there's certainly more that could be done.

kooshi avatar Mar 24 '23 23:03 kooshi

Definitely not a silly question, there are multiple projects dedicated to doing similar things. You can't really get that result by fine tuning. You just need to leverage it's ability to understand and process information.

You can start by using LlamaIndex, which queries the LLM via Langchain, which can support Llama either through Basaran or by writing your own API adapter. I made it work with the text-generation-webui api.

It's not as hard as it sounds to setup... but the results were not great IMO. I played with it for a bit, but there's certainly more that could be done.

Thanks @kooshi . I sort of get some idea from Langchain, we can get the embedding of each documentation and compare it with the embeddng of an asked question, so that we can get the most related ones. Then use them as input with the question as instruction to get the response. For finetuning itself, I assume we don't have to worry about the irrelavent documentations, just use the related ones to setup the data.

Then the only question would be how to get the embedding? (Maybe need another small model?)

SilenceGTX avatar Mar 25 '23 11:03 SilenceGTX

I am unsure why fine-tuning wouldn't work. It will be more expensive for sure since in context learning does not involve training (what LLamaIndex does), but fine tuning on your own corpus will certanly work, it's called domain adaptation. You are saying that it won't work in this mode because the lora adapters are perhaps too small in terms of parameters?

Edit: Check this issue https://github.com/tloen/alpaca-lora/issues/45 they are talking about the same.

gianfra-t avatar Mar 25 '23 16:03 gianfra-t

True. It is an oversimplification to say you "can't" get those results by fine-tuning. Rather to say it could be difficult, especially on consumer hardware, because I don't think an LLM could learn a lot of new facts with a Lora, and you need to avoid "catastrophic forgetting". Some combination of fine tuning and indexing would probably yield the best results.

I'm certainly no expert, and I'm really excited to see how this exact need is met in the next few months, because it will be one of the most useful ways we can use these models.

kooshi avatar Mar 25 '23 18:03 kooshi

I see. But isn't there a way to know the rank or essentially the Lora parameters required to learn x amounts of new tokens? Or at least try with an increasing amount of parameters and validate to avoid forgetting. I'm genuinely interested in this and see how it develops.

gianfra-t avatar Mar 25 '23 23:03 gianfra-t

Thanks guys @gianfra-t @kooshi , I guess my question is more about how to design the dataset for finetuning on very long inputs. But I assume by leveraging tools like langchain, it seems not a problem any more.

SilenceGTX avatar Mar 27 '23 10:03 SilenceGTX