llama.cpp
llama.cpp copied to clipboard
BERT wordpiece tokenizer differers from official HF implementation
Our wordpiece tokenizer has issues with unicode. One of the problems is incomplete NFD normalization, causing many characters with accents to be dropped entirely when tokenized. Examples include Kantō
-> Kant
and lǜshi
-> lshi
.
Here is the diff for nomic-embed-text-v1 on wikitext.test.raw:
Diff
--- good_tokens.txt 2024-02-14 14:57:16.501519622 -0500
+++ lcpp_tokens.txt 2024-02-14 14:57:16.832522224 -0500
@@ -3665,8 +3665,8 @@
1006: (
100: [UNK]
1790: 史
-11895: shi
-11895: shi
+14021: sh
+14021: sh
1007: )
1012: .
1996: the
@@ -3848,7 +3848,7 @@
1006: (
100: [UNK]
100: [UNK]
-11895: shi
+14021: sh
25981: sheng
1007: )
1010: ,
@@ -4490,8 +4490,8 @@
2124: known
2005: for
2010: his
-16299: lush
-2072: ##i
+1048: l
+6182: ##shi
1010: ,
1037: a
2828: type
@@ -4538,8 +4538,8 @@
1012: .
2010: his
2190: best
-16299: lush
-2072: ##i
+1048: l
+6182: ##shi
2224: use
1996: the
5903: parallel
@@ -5255,8 +5255,8 @@
1999: in
17903: transforming
1996: the
-16299: lush
-2072: ##i
+1048: l
+6182: ##shi
2013: from
8210: mere
2773: word
@@ -5400,7 +5400,6 @@
22281: mats
19098: ##uo
24234: bash
-2080: ##o
1010: ,
1996: the
2200: very
@@ -5483,10 +5482,9 @@
2004: as
25277: bunk
2050: ##a
-18454: shu
+14021: sh
2890: ##re
4509: ##ish
-2226: ##u
1999: in
1996: the
6280: 9th
@@ -5570,11 +5568,11 @@
18952: sai
6806: ##ho
22332: ##kus
-6979: ##hu
+2232: ##h
1012: .
2010: his
3076: student
-14684: chu
+10381: ch
5289: ##gan
25540: eng
8454: ##ets
@@ -5598,14 +5596,14 @@
18443: preface
2015: ##s
1012: .
-14684: chu
+10381: ch
5289: ##gan
1005: '
1055: s
3076: student
21025: gi
-3527: ##do
-18454: shu
+2094: ##d
+14021: sh
17426: ##shin
2018: had
2485: close
@@ -5637,7 +5635,7 @@
2028: one
2154: day
9152: ni
-5558: ##jo
+3501: ##j
10930: yo
6182: ##shi
15319: ##moto
@@ -5661,7 +5659,7 @@
1010: ,
2356: asked
21025: gi
-3527: ##do
+2094: ##d
1010: ,
1000: "
2323: should
@@ -5678,7 +5676,7 @@
1029: ?
1000: "
21025: gi
-3527: ##do
+2094: ##d
15048: dared
2000: to
7514: reply
@@ -5772,7 +5770,6 @@
2386: ##man
1010: ,
24234: bash
-2080: ##o
1010: ,
1998: and
18454: shu
@@ -5817,8 +5814,8 @@
11865: fu
1005: '
1055: s
-16299: lush
-2072: ##i
+1048: l
+6182: ##shi
1006: (
100: [UNK]
100: [UNK]
@@ -5847,7 +5844,7 @@
14483: ##cian
5784: scholars
1998: and
-16480: cho
+10381: ch
11483: ##nin
1006: (
27938: townspeople
@@ -5892,12 +5889,12 @@
4261: 37
1997: of
11721: ga
-6806: ##ho
+2232: ##h
21122: bun
-14235: ##shu
+4095: ##sh
2008: that
1062: z
-14428: ##ime
+2213: ##m
2072: ##i
1031: [
4241: du
@@ -5945,7 +5942,6 @@
22281: mats
19098: ##uo
24234: bash
-2080: ##o
1010: ,
1996: the
4602: greatest
@@ -6151,8 +6147,8 @@
7893: verse
1010: ,
2030: or
-16299: lush
-2072: ##i
+1048: l
+6182: ##shi
1007: )
1010: ,
1998: and
@@ -9195,7 +9191,7 @@
1996: the
11003: preceding
11865: fu
-6499: ##so
+2015: ##s
2465: class
1010: ,
2027: they
@@ -9216,7 +9212,6 @@
1996: the
2307: great
26044: kant
-2080: ##o
8372: earthquake
1999: in
4927: 1923
@@ -9490,7 +9485,7 @@
1997: of
1996: the
11865: fu
-6499: ##so
+2015: ##s
1030: @
1011: -
1030: @
@@ -9614,10 +9609,10 @@
2142: united
2163: states
1012: .
-20251: sato
+2938: sat
8915: te
10422: ##tsu
-28160: ##taro
+7559: ##tar
1010: ,
1037: a
2887: japanese
@@ -9676,7 +9671,7 @@
2023: this
6463: ratio
1010: ,
-20251: sato
+2938: sat
14833: theo
18425: ##rized
1010: ,
@@ -10020,7 +10015,7 @@
1997: of
1996: the
11865: fu
-6499: ##so
+2015: ##s
2465: class
2020: were
4821: ultimately
@@ -10032,7 +10027,7 @@
2093: three
2062: more
11865: fu
-6499: ##so
+2015: ##s
1030: @
1011: -
1030: @
@@ -10048,7 +10043,7 @@
1010: ,
1998: and
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
1007: )
2020: were
@@ -10089,7 +10084,7 @@
2063: ##e
1998: and
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
2127: until
1996: the
@@ -10115,7 +10110,7 @@
5082: progress
1997: of
11865: fu
-6499: ##so
+2015: ##s
1005: '
1055: s
2810: construction
@@ -10145,7 +10140,7 @@
4757: ##ss
1996: the
11865: fu
-6499: ##so
+2015: ##s
1030: @
1011: -
1030: @
@@ -10251,7 +10246,7 @@
1999: in
1996: the
11865: fu
-6499: ##so
+2015: ##s
2465: class
1998: and
3041: earlier
@@ -10442,7 +10437,7 @@
1999: in
1996: the
11865: fu
-6499: ##so
+2015: ##s
2465: class
1012: .
2023: this
@@ -10514,7 +10509,7 @@
1997: of
1996: the
11865: fu
-6499: ##so
+2015: ##s
2465: class
2008: that
2009: it
@@ -10953,7 +10948,7 @@
1007: )
1006: (
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
1998: and
2003: is
@@ -10969,7 +10964,7 @@
27829: kam
26029: ##pon
20996: ro
-2175: go
+1043: g
2300: water
1030: @
1011: -
@@ -11082,7 +11077,7 @@
1007: )
1998: and
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
14872: exceeded
2008: that
@@ -11224,7 +11219,7 @@
2063: ##e
1998: and
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
2018: had
2093: three
@@ -13815,7 +13810,7 @@
7584: conversion
2138: because
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
2018: had
4265: suffered
@@ -13869,7 +13864,8 @@
1012: .
1996: the
11865: fu
-17063: ##sos
+2015: ##s
+2015: ##s
2020: were
5115: scheduled
2000: to
@@ -15034,7 +15030,7 @@
4170: fleet
1012: .
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
2018: had
2019: an
@@ -15148,7 +15144,6 @@
4927: 1923
2307: great
26044: kant
-2080: ##o
8372: earthquake
4930: struck
1010: ,
@@ -15206,7 +15201,7 @@
4739: 1931
1998: and
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
1005: '
1055: s
@@ -15269,7 +15264,7 @@
2257: august
4347: 1937
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
10768: fe
22155: ##rrie
@@ -15405,8 +15400,8 @@
1996: the
2422: light
6839: carrier
-7570: ho
-22231: ##sho
+1044: h
+4095: ##sh
2004: as
6802: distant
3104: cover
@@ -15431,7 +15426,7 @@
2063: ##e
1998: and
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
4066: sort
6340: ##ied
@@ -15503,7 +15498,7 @@
3282: gun
1997: of
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
1005: '
1055: s
@@ -15615,7 +15610,7 @@
1030: @
5902: admiral
11895: shi
-3217: ##ro
+2099: ##r
27006: tak
3022: ##as
2226: ##u
@@ -15732,7 +15727,7 @@
3826: 1943
1998: and
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
2012: at
21871: sas
@@ -15797,7 +15792,7 @@
4397: newly
2949: completed
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
1996: the
2206: following
@@ -16025,7 +16020,8 @@
5902: admiral
10147: ji
3736: ##sa
-23670: ##buro
+8569: ##bu
+2099: ##r
11472: oz
10830: ##awa
1998: and
@@ -16304,7 +16300,7 @@
1016: 2
1012: .
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
2001: was
8217: lightly
@@ -16495,7 +16491,7 @@
2886: attack
1012: .
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
2001: was
11551: unsuccessfully
@@ -16583,8 +16579,7 @@
2005: for
25933: ama
4328: ##mi
-9808: os
-16369: ##hima
+24772: ##shima
1012: .
2043: when
2027: they
@@ -16598,7 +16593,7 @@
4015: transferred
2000: to
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
1998: and
27269: hoisted
@@ -16703,7 +16698,7 @@
27053: indochina
1998: and
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
2150: became
10565: flagship
@@ -16739,7 +16734,6 @@
1996: the
2422: light
10844: cruiser
-1051: o
7677: ##yo
3527: ##do
2006: on
@@ -16870,7 +16864,6 @@
2020: were
13127: escorted
2011: by
-1051: o
7677: ##yo
3527: ##do
1998: and
@@ -16984,7 +16977,7 @@
5388: 58
1998: and
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
2001: was
2718: hit
@@ -17126,7 +17119,7 @@
14107: pumping
1012: .
1044: h
-10513: ##yu
+2100: ##y
3654: ##ga
2001: was
1037: a
@@ -43604,7 +43597,6 @@
1996: the
2887: japanese
4290: kong
-2080: ##o
1998: and
7632: hi
7416: ##ei
@@ -47216,7 +47208,7 @@
3212: navy
1012: .
1996: the
-12849: ko
+1047: k
22513: ##tet
6342: ##su
1006: (
@@ -53282,7 +53274,6 @@
2013: from
3306: greek
1174: τ
-29723: ##ε
29728: ##μ
16177: ##ν
29723: ##ε
@@ -53302,7 +53293,6 @@
1998: and
1173: σ
29731: ##π
-29730: ##ο
16177: ##ν
29722: ##δ
29735: ##υ
@@ -100436,7 +100426,6 @@
1999: in
1007: )
1999: in
-1051: o
6590: ##ita
7498: prefecture
1010: ,
@@ -101378,7 +101367,8 @@
2001: was
5409: worst
1999: in
-27603: kochi
+1047: k
+5428: ##chi
1998: and
2000: to
24917: ##kushima
@@ -101796,7 +101786,7 @@
19808: mina
4328: ##mi
21351: ##dai
-3406: ##to
+2102: ##t
1010: ,
15052: okinawa
1012: .
@@ -103992,9 +103982,9 @@
2345: final
21042: landfall
2379: near
-11503: cam
+1039: c
+2213: ##m
6887: ph
-2050: ##a
1010: ,
5148: vietnam
2006: on
@@ -105809,8 +105799,7 @@
1998: and
25933: ama
4328: ##mi
-9808: os
-16369: ##hima
+24772: ##shima
2006: on
2244: september
2539: 19
@@ -105976,7 +105965,7 @@
1997: of
5292: ha
5428: ##chi
-5558: ##jo
+3501: ##j
1030: @
1011: -
1030: @
@@ -106013,7 +106002,7 @@
5601: mph
1007: )
2012: at
-16480: cho
+10381: ch
6182: ##shi
1010: ,
27368: chiba
@@ -106092,8 +106081,7 @@
1999: in
25933: ama
4328: ##mi
-9808: os
-16369: ##hima
+24772: ##shima
1010: ,
1996: the
4040: storm
@@ -106111,7 +106099,7 @@
2006: on
5292: ha
5428: ##chi
-5558: ##jo
+3501: ##j
1010: ,
3612: wind
26903: gust
@@ -108909,7 +108897,8 @@
1997: of
10101: rainfall
1999: in
-27603: kochi
+1047: k
+5428: ##chi
1010: ,
2096: while
2844: strong
@@ -132935,12 +132924,11 @@
4351: designated
1062: z
2072: ##i
-29731: ##π
2349: due
2000: to
1996: the
1000: "
-1170: π
+100: [UNK]
1000: "
19587: topology
1012: .
@@ -132953,7 +132941,7 @@
1000: "
2030: or
1000: "
-1170: π
+100: [UNK]
1000: "
2930: section
2003: is
@@ -133054,7 +133042,6 @@
3372: ##nt
1062: z
2072: ##i
-29731: ##π
1012: .
2045: there
2024: are
@@ -133716,7 +133703,7 @@
1047: k
1027: =
1015: 1
-1179: ω
+100: [UNK]
1012: .
2023: this
2003: is
@@ -133882,17 +133869,15 @@
2003: is
2170: called
1037: a
-1170: π
+100: [UNK]
2930: section
1012: .
2073: where
1062: z
2072: ##i
-29731: ##π
5344: faces
1062: z
2072: ##i
-29731: ##π
1996: the
2930: section
2061: so
@@ -147284,17 +147269,11 @@
1693: ア
30221: ##イ
30257: ##ラ
-30246: ##フ
-30240: ##ト
30241: ##ナ
30259: ##ル
-30240: ##ト
-30235: ##タ
30237: ##ッ
30228: ##ク
-1702: ク
30259: ##ル
-30232: ##シ
30219: ##ア
1909: 王
1671: の
@@ -147314,10 +147293,10 @@
2226: ##u
11972: guru
26541: ##jia
-1051: o
+100: [UNK]
2053: no
7632: hi
-6806: ##ho
+2232: ##h
1007: )
1010: ,
2003: is
@@ -188846,7 +188825,8 @@
1025: ;
3763: latin
1024: :
-19212: nero
+11265: ne
+2099: ##r
25017: claudius
11604: caesar
11668: augustus
@@ -200556,7 +200536,9 @@
29869: ##र
29879: ##ो
29863: ##न
+100: [UNK]
1317: ग
+100: [UNK]
1000: "
2029: which
2003: is
@@ -200633,7 +200615,11 @@
29836: ##و
29817: ##ت
25573: ##ا
-100: [UNK]
+1282: س
+23673: ##ل
+29836: ##و
+15394: ##د
+29836: ##و
23856: kota
16183: sal
6784: ##ud
@@ -226008,27 +225994,29 @@
9973: pinyin
1024: :
2568: mind
-20391: ##ulu
+2140: ##l
1025: ;
21877: pe
+100: [UNK]
1044: h
1030: @
1011: -
1030: @
-1051: o
2063: ##e
1030: @
1011: -
1030: @
-10147: ji
+1046: j
1024: :
8026: bin
1030: @
1011: -
1030: @
2000: to
+100: [UNK]
1011: -
8840: lo
+100: [UNK]
1007: )
2003: is
1037: a
@@ -233749,7 +233737,8 @@
16107: ##nko
10882: fi
15000: ##lip
-9142: ##ovic
+4492: ##ov
+2072: ##i
2165: took
2058: over
2004: as
@@ -233792,7 +233781,7 @@
1997: of
2175: go
13102: ##sp
-2594: ##ic
+2072: ##i
1998: and
2379: near
22889: sl
@@ -233869,7 +233858,7 @@
21590: ##sko
6819: vi
6460: ##je
-3401: ##ce
+2063: ##e
27885: ob
18053: ##rane
1516: –
@@ -234461,7 +234450,8 @@
16107: ##nko
10882: fi
15000: ##lip
-9142: ##ovic
+4492: ##ov
+2072: ##i
1010: ,
10655: likewise
1037: a
@@ -234636,7 +234626,8 @@
1010: ,
10882: fi
15000: ##lip
-9142: ##ovic
+4492: ##ov
+2072: ##i
2165: took
2058: over
3094: command
@@ -234694,7 +234685,7 @@
2000: to
2175: go
13102: ##sp
-2594: ##ic
+2072: ##i
1010: ,
2073: where
2009: it
@@ -234706,7 +234697,7 @@
2491: control
2175: go
13102: ##sp
-2594: ##ic
+2072: ##i
2114: against
1996: the
1046: j
@@ -234719,19 +234710,20 @@
4123: battalion
4110: captured
22827: kan
-21335: ##iza
+2072: ##i
+2050: ##a
10492: barracks
1999: in
2175: go
13102: ##sp
-2594: ##ic
+2072: ##i
1012: .
2076: during
4337: combat
1999: in
2175: go
13102: ##sp
-2594: ##ic
+2072: ##i
1010: ,
2382: 30
3629: troops
@@ -234744,8 +234736,8 @@
1010: ,
7197: assisted
2011: by
-6735: luck
-2080: ##o
+11320: lu
+3683: ##ko
11867: sp
2226: ##u
1010: ,
@@ -234756,7 +234748,7 @@
2236: general
19817: tr
13006: ##aj
-3401: ##ce
+2063: ##e
1047: k
12096: ##rst
6777: ##ev
@@ -234782,7 +234774,8 @@
7333: deployed
2000: to
2777: met
-14733: ##kovic
+7724: ##kov
+2072: ##i
2006: on
2654: 28
2255: october
@@ -234805,7 +234798,7 @@
2000: to
2175: go
13102: ##sp
-2594: ##ic
+2072: ##i
1010: ,
1037: a
2112: part
@@ -234887,7 +234880,7 @@
14713: ##ija
1058: v
2721: ##la
-19053: ##cic
+2072: ##i
4123: battalion
2241: based
1999: in
@@ -234948,7 +234941,7 @@
21590: ##sko
6819: vi
6460: ##je
-3401: ##ce
+2063: ##e
27885: ob
18053: ##rane
1516: –
@@ -235004,7 +234997,7 @@
1996: the
2181: area
1997: of
-24053: ska
+2912: ##ka
19892: ##br
2078: ##n
3900: ##ja
@@ -235067,20 +235060,20 @@
7221: ban
15333: je
2721: ##la
-19053: ##cic
+2072: ##i
4123: battalion
1010: ,
13523: mat
14713: ##ija
1058: v
2721: ##la
-19053: ##cic
+2072: ##i
4123: battalion
1010: ,
10768: fe
20683: ##rdo
10514: su
-19053: ##cic
+2072: ##i
4123: battalion
1998: and
2112: part
@@ -254648,7 +254641,6 @@
3747: influence
1997: of
2332: king
-1097: æ
10760: ##the
20850: ##lb
19058: ##ald
@@ -257962,7 +257954,6 @@
1999: in
15522: cyrillic
1024: :
-1194: п
2080: ##o
29742: ##д
25529: ##в
@cebtenzzre look like duplicate this issue? https://github.com/ggerganov/llama.cpp/issues/3502
there is a PR here https://github.com/ggerganov/llama.cpp/issues/4868 other watting for PR is drafting here https://github.com/ggerganov/llama.cpp/pull/5613#discussion_r1497483512
so, now BERT based models supported?
We likely need to move all the tokenization-related code from llama.cpp to a separate file. Otherwise, the llama.cpp will become too messy.
@cebtenzzre look like duplicate this issue? #3502 #4868 watting for PR #4868
Possibly related, but keep in mind that BERT uses an entirely separate tokenizer implementation (wordpiece "WPM") from all other models (SentencePiece "SPM" or GPT-2 "BPE").
Is the SPM preprocessor also replacing accented characters? Seems like we should be able to reuse bits from that. Btw, in case it's useful for folks, I made a little Python function that prints out a color-coded token diff between our results and those from Huggingface (it goes through llama-cpp-python
):
https://gist.github.com/iamlemec/52eaa4961762efb9c064b871a67f6cc6
The biggest instance I'm finding there is with dash variants like emdash. But basically still a case of replacing certain complex characters with their base forms.
A comment regarding this issue from @apage43:
tokenizers bert normalizer's accent stripping is unicode "NFD" normalization, which transforms any accented chars into the "canonical decomposition" (here's where the lookup table comes in - for tokenizers the table comes from here) - the base char + accent codepoint form instead of the single-codepoint form, then just stripping any accent ("non-spacing mark") characters (another table)
That's very helpful @cebtenzzre! Opening a PR with this in a minute.
@cebtenzzre can you take a look on new deploy to see the improvment? https://github.com/ggerganov/llama.cpp/pull/5740