text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

How do i use `trust_remove_code=True` for mosaic models?

Open cfregly opened this issue 2 years ago • 6 comments

Feature request

pretrained_model_dir = 'mosaicml/mpt-7b'

model = AutoModelForCausalLM.from_config(config, trust_remote_code=True, torch_dtype=torch.float16)

https://discuss.huggingface.co/t/how-to-use-trust-remote-code-true-with-load-checkpoint-and-dispatch/39849/1

Motivation

model-specific params

Your contribution

sure

cfregly avatar May 23 '23 16:05 cfregly

#363 is a first step to support mosaicml/mpt-7b. However, it seems that the past_key_values layout used by this model is not compatible with this repository. I will try some workaround on the coming days.

OlivierDehaene avatar May 23 '23 18:05 OlivierDehaene

hi! i am able to get the server to run properly with an MPT model (converted to HF format using their scripts in their llm-foundry repo).. I can run generate fine using normal python and HF generate().. but the generate server (using the same generate parameters) doesn't work sadly.. it just returns an empty string... wondering if you ran into the same problem? thank you!!

okay.. edit.. seems like the issue is the the decoding is stopping prematurely due to the EOS token.. is there a way to make it just generate up to the sequence length? looking through it looks like there is something in rust-land to turn this on (it looks like for testing), but I can't pass anything through the REST endpoint to trigger this behavior

metemadi avatar May 26 '23 13:05 metemadi

@metemadi did you ever get any further with this?

I managed to stand up MPT-7B in the container but I was also only getting a single returned token.

harryjulian avatar Jun 19 '23 08:06 harryjulian

@OlivierDehaene Do you know exactly what's causing the issues with the MPT model? I'm looking at making a fix.

harryjulian avatar Jun 19 '23 10:06 harryjulian

Hi @harryjulian and others! So unfortunately no - I can run the code (I take an MPT model I trained and then used their helper script in the llm-foundary repo to convert it to an HF checkpoint - then import it as any other HF model), but it just generates <|endoftext|> the whole time (therefore stopping generation). I even tricked the model (by modifying the configs) into not thinking <|endoftext|> was a stop token.. but guess what happened? It just generated a BUNCH of <|endoftext|> (up to the number of new tokens requested)... so then I threw in the towel and just used the generate() function for my application : ) I even tried another library (gpt4all) to get a chat interface.. which also has a HF import option.. and that gave me a different error. A huge thank you to everyone for looking at this - the llm-foundary tools are so dead simple to use for training (and very fast).. would love to be able to get custom MPT checkpoints working with your amazing high-performance streaming inference library!

metemadi avatar Jun 19 '23 12:06 metemadi

PR https://github.com/huggingface/text-generation-inference/pull/514 should help run MPT models on TGI.

It doesn't use flash (yet) becauses that requires forking and extending flash attention kernel.

Narsil avatar Jul 01 '23 10:07 Narsil