flair
flair copied to clipboard
[Bug]: index out of range in self
Describe the bug
ner-dutch model throws index out of range in self error on some character combinations.
To Reproduce
Example 1
import flair
from flair.data import Sentence
from flair.models import SequenceTagger
tagger = SequenceTagger.load("flair/ner-dutch")
error_chars = []
for ichar in [chr(i) for i in range(ord('a'),ord('z')+1)]:
for jchar in [chr(i) for i in range(ord('a'),ord('z')+1)]:
try:
tagger.predict(Sentence(jchar+ichar))
except:
error_chars.append(jchar+ichar)
print(error_chars)
OUTPUT:
['aa', 'ea', 'ia', 'oa', 'qa', 'ra', 'sa', 'ta', 'ua', 'xa', 'ya', 'cb', 'eb', 'fb', 'gb', 'hb', 'ib', 'jb', 'kb', 'lb', 'mb', 'nb', 'pb', 'qb', 'rb', 'sb', 'tb', 'ub', 'xb', 'yb', 'ec', 'fc', 'gc', 'hc', 'ic', 'jc', 'kc', 'lc', 'mc', 'nc', 'oc', 'qc', 'rc', 'tc', 'uc', 'xc', 'yc', 'ad', 'ed', 'fd', 'gd', 'hd', 'id', 'jd', 'kd', 'ld', 'md', 'nd', 'od', 'pd', 'qd', 'rd', 'sd', 'td', 'ud', 'xd', 'yd', 'ae', 'ee', 'fe', 'ie', 'ke', 'ne', 'oe', 'qe', 're', 'se', 'ue', 'xe', 'ye', 'cf', 'ef', 'ff', 'gf', 'hf', 'if', 'jf', 'kf', 'lf', 'mf', 'nf', 'pf', 'qf', 'rf', 'sf', 'tf', 'uf', 'xf', 'yf', 'ag', 'cg', 'eg', 'fg', 'gg', 'hg', 'ig', 'jg', 'kg', 'lg', 'mg', 'ng', 'og', 'pg', 'qg', 'rg', 'sg', 'tg', 'ug', 'xg', 'yg', 'ah', 'ch', 'eh', 'fh', 'gh', 'hh', 'ih', 'jh', 'kh', 'lh', 'mh', 'nh', 'oh', 'ph', 'qh', 'rh', 'sh', 'uh', 'xh', 'yh', 'ai', 'ei', 'ii', 'ji', 'ni', 'oi', 'qi', 'si', 'ti', 'xi', 'yi', 'aj', 'cj', 'ej', 'fj', 'gj', 'hj', 'ij', 'jj', 'kj', 'lj', 'mj', 'nj', 'oj', 'pj', 'qj', 'rj', 'sj', 'tj', 'uj', 'xj', 'yj', 'ak', 'ck', 'ek', 'fk', 'gk', 'hk', 'jk', 'kk', 'lk', 'mk', 'nk', 'ok', 'qk', 'rk', 'sk', 'tk', 'uk', 'xk', 'yk', 'el', 'hl', 'il', 'jl', 'll', 'ml', 'nl', 'ol', 'ql', 'rl', 'sl', 'tl', 'ul', 'xl', 'yl', 'am', 'em', 'fm', 'gm', 'hm', 'im', 'jm', 'lm', 'nm', 'pm', 'qm', 'rm', 'sm', 'tm', 'um', 'xm', 'ym', 'an', 'cn', 'fn', 'gn', 'hn', 'jn', 'kn', 'ln', 'mn', 'nn', 'pn', 'qn', 'rn', 'sn', 'tn', 'un', 'xn', 'yn', 'ao', 'eo', 'io', 'oo', 'qo', 'uo', 'xo', 'yo', 'ap', 'cp', 'ep', 'fp', 'gp', 'hp', 'ip', 'jp', 'kp', 'lp', 'mp', 'np', 'pp', 'qp', 'rp', 'tp', 'xp', 'yp', 'aq', 'cq', 'eq', 'fq', 'gq', 'hq', 'iq', 'jq', 'kq', 'lq', 'mq', 'nq', 'oq', 'pq', 'qq', 'rq', 'sq', 'tq', 'uq', 'xq', 'yq', 'ar', 'hr', 'ir', 'jr', 'lr', 'mr', 'nr', 'or', 'qr', 'rr', 'sr', 'tr', 'ur', 'xr', 'yr', 'as', 'cs', 'es', 'fs', 'gs', 'hs', 'js', 'ks', 'ls', 'ms', 'ns', 'os', 'ps', 'qs', 'rs', 'ss', 'ts', 'us', 'xs', 'ys', 'at', 'ct', 'et', 'ft', 'gt', 'ht', 'it', 'jt', 'kt', 'lt', 'mt', 'nt', 'ot', 'pt', 'qt', 'rt', 'tt', 'ut', 'xt', 'yt', 'cu', 'eu', 'iu', 'ou', 'tu', 'uu', 'xu', 'yu', 'av', 'ev', 'fv', 'gv', 'hv', 'iv', 'jv', 'kv', 'lv', 'mv', 'nv', 'ov', 'pv', 'qv', 'rv', 'sv', 'uv', 'xv', 'yv', 'aw', 'cw', 'ew', 'fw', 'gw', 'hw', 'iw', 'jw', 'lw', 'mw', 'nw', 'ow', 'pw', 'qw', 'rw', 'sw', 'tw', 'xw', 'yw', 'ax', 'cx', 'fx', 'gx', 'hx', 'ix', 'jx', 'kx', 'lx', 'mx', 'nx', 'ox', 'px', 'qx', 'rx', 'sx', 'tx', 'ux', 'xx', 'yx', 'ay', 'cy', 'ey', 'fy', 'gy', 'iy', 'jy', 'ky', 'ly', 'ny', 'oy', 'qy', 'ry', 'sy', 'ty', 'uy', 'xy', 'yy', 'az', 'cz', 'ez', 'fz', 'gz', 'hz', 'iz', 'jz', 'kz', 'lz', 'mz', 'nz', 'oz', 'pz', 'qz', 'rz', 'sz', 'tz', 'uz', 'xz', 'yz']
Example 2
tagger.predict(Sentence("eedaflegging"))
OUTPUT
IndexError: index out of range in self
Expected behaivor
Expected the pipeline to return entities.
Logs and Stack traces
No response
Screenshots
No response
Additional Context
Same Issue as: https://github.com/flairNLP/flair/issues/2813
Environment
Versions:
Flair
0.12
Pytorch
1.13.1+cu116
Transformers
4.26.1
GPU
True
Hi @lennertvandevelde looks like this happens, because the vocabulary of the bertje-embeddings, was updated after release see here.
@alanakbik I suppose we should just retrain the model.
In the meantime you can just use a hotfix by adding the embeddings for the tokens that wer added afterwards using the following code:
from flair.models import SequenceTagger
from flair.embeddings import TransformerWordEmbeddings
from torch import nn
import torch
tagger = SequenceTagger.load("flair/ner-dutch")
embeddings = TransformerWordEmbeddings('GroNLP/bert-base-dutch-cased')
new_embedding_tensor = torch.cat([tagger.embeddings.model.get_input_embeddings().weight, embeddings.model.get_input_embeddings().weight[tagger.embeddings.model.get_input_embeddings().num_embeddings:-1]])
new_input_embeddings = nn.Embedding.from_pretrained(new_embedding_tensor , freeze=False)
tagger.embeddings.model.set_input_embeddings(new_input_embeddings)
tagger.embeddings.base_model_name="GroNLP/bert-base-dutch-cased"
tagger.save("ner-dutch-fixed.pt")
#nDD v nevzvvbqlekvl m s das Z umm Vvelvet
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.