text-generation-inference
text-generation-inference copied to clipboard
How do i use `trust_remove_code=True` for mosaic models?
Feature request
pretrained_model_dir = 'mosaicml/mpt-7b'
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True, torch_dtype=torch.float16)
https://discuss.huggingface.co/t/how-to-use-trust-remote-code-true-with-load-checkpoint-and-dispatch/39849/1
Motivation
model-specific params
Your contribution
sure
#363 is a first step to support mosaicml/mpt-7b. However, it seems that the past_key_values layout used by this model is not compatible with this repository. I will try some workaround on the coming days.
hi! i am able to get the server to run properly with an MPT model (converted to HF format using their scripts in their llm-foundry repo).. I can run generate fine using normal python and HF generate().. but the generate server (using the same generate parameters) doesn't work sadly.. it just returns an empty string... wondering if you ran into the same problem? thank you!!
okay.. edit.. seems like the issue is the the decoding is stopping prematurely due to the EOS token.. is there a way to make it just generate up to the sequence length? looking through it looks like there is something in rust-land to turn this on (it looks like for testing), but I can't pass anything through the REST endpoint to trigger this behavior
@metemadi did you ever get any further with this?
I managed to stand up MPT-7B in the container but I was also only getting a single returned token.
@OlivierDehaene Do you know exactly what's causing the issues with the MPT model? I'm looking at making a fix.
Hi @harryjulian and others! So unfortunately no - I can run the code (I take an MPT model I trained and then used their helper script in the llm-foundary repo to convert it to an HF checkpoint - then import it as any other HF model), but it just generates <|endoftext|> the whole time (therefore stopping generation). I even tricked the model (by modifying the configs) into not thinking <|endoftext|> was a stop token.. but guess what happened? It just generated a BUNCH of <|endoftext|> (up to the number of new tokens requested)... so then I threw in the towel and just used the generate() function for my application : ) I even tried another library (gpt4all) to get a chat interface.. which also has a HF import option.. and that gave me a different error. A huge thank you to everyone for looking at this - the llm-foundary tools are so dead simple to use for training (and very fast).. would love to be able to get custom MPT checkpoints working with your amazing high-performance streaming inference library!
PR https://github.com/huggingface/text-generation-inference/pull/514 should help run MPT models on TGI.
It doesn't use flash (yet) becauses that requires forking and extending flash attention kernel.