mlx-swift-examples icon indicating copy to clipboard operation
mlx-swift-examples copied to clipboard

LLMEval crashing due to force unwrapping of optional in token decode for Gemma

Open ViRo3 opened this issue 4 months ago • 6 comments

st.txt is the crash log.

Model : Gemma 2B Quantized Prompt : "Write code to boot a raspberry pico"

ViRo3 avatar Apr 09 '24 17:04 ViRo3

This looks a bit like:

  • https://github.com/huggingface/swift-transformers/pull/52
  • https://github.com/huggingface/swift-transformers/issues/51

What version (hash) of swift-transformers are you using?

davidkoski avatar Apr 09 '24 19:04 davidkoski

74b9421

ViRo3 avatar Apr 10 '24 03:04 ViRo3

OK, that is the current head of main and includes that fix. Oh, interesting -- that particular prompt does reproduce it for me!

Here is the token in question:

(lldb) p model.convertIdToToken(235345)
(String?) nil

thought it looks like it is defined in the tokenizer.json:

      "#": 235345,

The problem is that there are two (ish) #'s:

egrep '122661|235345' tokenizer.json
      "#": 122661,
      "#": 235345,

though one of them has some extra unicode, in particular a ZERO WIDTH NO-BREAK SPACE:

egrep '122661|235345' tokenizer.json | od -c
0000000                            " 357 273 277   #   "   :       1   2
0000020    2   6   6   1   ,  \n                           "   #   "   :
0000040        2   3   5   3   4   5   ,  \n                            
0000051

It looks like the strings map to the same value on read and the tokenizer model loses the entry for 235345.

davidkoski avatar Apr 10 '24 05:04 davidkoski

@ViRo3 could install the latest swift-transformers from GH and let us know if the issue persists.

Btw this might be a huggingface swift-transformers issue and not MLX.

Blaizzy avatar Apr 10 '24 07:04 Blaizzy

@ViRo3 could install the latest swift-transformers from GH and let us know if the issue persists.

Btw this might be a huggingface swift-transformers issue and not MLX.

I have checked and it persists and is a swift-transformer issue so filed an issue there too.

ViRo3 avatar Apr 10 '24 13:04 ViRo3

I had a thought about how this might be fixed -- we might be able to make something along these lines:

struct CodePointString : StringProtocol {
    let value: String

    static func ==(lhs: CodePointString, rhs: CodePointString) {
        // this isn't actually comparable but this is the idea
        lhs.utf16 == rhs.utf16
    }

    func hash(hasher: inout Hasher) {
        hasher.combine(value.utf16)
    }

    ...
}

Then we load the config as [CodePointString:Any]. Something like this would let us treat the strings as code points (which is what python, etc. do) rather than unicode strings. Maybe :-)

davidkoski avatar Apr 11 '24 05:04 davidkoski