h2o-llmstudio
h2o-llmstudio copied to clipboard
Generate settings and MoE Loss
This PR addresses the following:
New max_time setting for generation allowing to specifiy a max second time per generation. Closes https://github.com/h2oai/h2o-llmstudio/issues/568
New prompt_lookup_num_tokens as discussed in https://twitter.com/joao_gante/status/1747322413006643259
Will likely only help for summarization and QA tasks - default chat inference even got slower by using it
But let's keep it as a setting one can try
Adds a new loss function MoECrossEntropy that can be used for MoE models like Mixtral. Follows the implementation of https://arxiv.org/pdf/2101.03961.pdf as implemented in https://github.com/huggingface/transformers/blob/v4.37.2/src/transformers/models/mixtral/modeling_mixtral.py#L77
First experiments with Mixtral and LoRA did not show a big impact. The scale of the loss is in general pretty much similar to the regular cross entropy, so the default additive term might be too low, but will keep recommended settings from paper and HF for now as default.
Needs more experimentation to better understand impact. Closes https://github.com/h2oai/h2o-llmstudio/issues/607
Maybe hold with the review a bit, I am exploring the loss a bit more right now. Probably with LoRA it will not even properly train the gate (which can be good).
closing this for now