mlx-swift-examples icon indicating copy to clipboard operation
mlx-swift-examples copied to clipboard

When to stop in the LLMEval?

Open MatthewWaller opened this issue 4 months ago • 7 comments

In the LLMEval project, the generation stops after reaching a limit on tokens. Is there a way to configure stopping when it finds a special token? I tried to look for the Phi 3's end token but it seems to go off the rails earlier than when <|end|> or <|endoftext|> appear. Thoughts?

MatthewWaller avatar Apr 24 '24 22:04 MatthewWaller

It should stop at the end of sentence id: https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Evaluate.swift#L199

The fact that it's not stopping likely means it doesn't have the right EOS token ID set. Which model did you try?

awni avatar Apr 25 '24 02:04 awni

@awni was working with phi34bit

MatthewWaller avatar Apr 25 '24 02:04 MatthewWaller

Looks like this is the eos token for that model: https://huggingface.co/mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed/blob/main/tokenizer_config.json#L340. We'll need to check to make sure the IDs match / the tokenizer is reading it correctly.

awni avatar Apr 25 '24 03:04 awni

Specifically the code is looking for either the unknown token or the eos token:

        if t == tokenizer.unknownTokenId || t == tokenizer.eosTokenId {

https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Evaluate.swift#L199

The didGenerate block that is passed in can also return .stop if you are implementing this yourself.

davidkoski avatar Apr 25 '24 16:04 davidkoski

Alright, well unknownTokenId is 0 and eosTokenId is 32000, which I believe is correct, and it matches "eos_token": "<|endoftext|>", from HuggingFace. I can see in the debugger that the eosToken is <|endoftext|>. The model just never seems to produce that token. Hmmm. For instance, I can tell phi3 to "Write 3 words" and on HuggingFace chat, it appropriately stops. So I'm guessing it's producing that token for them. It just never shows up in the output I'm getting.

MatthewWaller avatar Apr 25 '24 21:04 MatthewWaller

It may be related to this: https://github.com/huggingface/swift-transformers/issues/92 -- we are not passing in a proper prompt and the generation may be impacted.

That issue is a bit terse but basically the extra tokens are not being honored when tokenizing.

davidkoski avatar Apr 25 '24 21:04 davidkoski

Oh dang, yeah, I see that now, I pass in "<|user|>\nWrite 2 words<|end|>\n<|assistant|>\n" after preparePrompt, and that should be 9 tokens or so. But it's encoded as 24 tokens!

MatthewWaller avatar Apr 25 '24 22:04 MatthewWaller

Saw that: https://github.com/huggingface/swift-transformers/issues/92 -- has been closed and special tokens should now be accounted for. I'm still running into issues with the model itself returning the '<|end|>' token when the assistant is done, wondering if anyone has found a more manual solution to getting the correct model (phi-3) response?

tylerckeller avatar Apr 30 '24 15:04 tylerckeller

I made a little project where I directly looked for that token (32001) and returned .stop if I found it, in the LLMEvaluator. Once I did that, and got the correct tokens in preparePrompt, everything worked correctly.

MatthewWaller avatar Apr 30 '24 15:04 MatthewWaller

Gotcha, so something similar to:

let result = await MLXLLM.generate(
    promptTokens: promptTokens, parameters: generateParameters, model: model,
    tokenizer: tokenizer
) { tokens in
    // update the output -- this will make the view show the text as it generates
    let endGen = tokens.contains(32001)
    if tokens.count % displayEveryNTokens == 0 {
        let text = tokenizer.decode(tokens: tokens)
        await MainActor.run {
            self.output = text
        }
    }

    if tokens.count >= maxTokens || endGen {
        return .stop
    } else {
        return .more
    }
}

tylerckeller avatar Apr 30 '24 16:04 tylerckeller

Exactly, and heads up that there is a little bug you may run into at the end, below that bit. I had to change it to

// update the text if needed, e.g. we haven't displayed because of displayEveryNTokens
            var validTokens = Array(result.tokens.prefix(while: { $0 != 32001 }))
            validTokens.removeLast()
            let text = tokenizer.decode(tokens: validTokens)
            await MainActor.run {
                if result.output != self.output {
                    self.output = text
                }
                running = false
                self.stat = " Tokens/second: \(String(format: "%.3f", result.tokensPerSecond))"
            }

Because you can still get the <|end|> token and more in there when it does that final bit of output.

MatthewWaller avatar Apr 30 '24 16:04 MatthewWaller

Closing now that the main issue has been resolved with transformers.

MatthewWaller avatar Apr 30 '24 16:04 MatthewWaller