flair icon indicating copy to clipboard operation
flair copied to clipboard

[Bug]: index out of range in self

Open lennertvandevelde opened this issue 1 year ago • 3 comments

Describe the bug

ner-dutch model throws index out of range in self error on some character combinations.

To Reproduce

Example 1


import flair
from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load("flair/ner-dutch")

error_chars = []
for ichar in [chr(i) for i in range(ord('a'),ord('z')+1)]:
  for jchar in [chr(i) for i in range(ord('a'),ord('z')+1)]:
    try:
      tagger.predict(Sentence(jchar+ichar))

    except:
      error_chars.append(jchar+ichar)
print(error_chars)

OUTPUT:

['aa', 'ea', 'ia', 'oa', 'qa', 'ra', 'sa', 'ta', 'ua', 'xa', 'ya', 'cb', 'eb', 'fb', 'gb', 'hb', 'ib', 'jb', 'kb', 'lb', 'mb', 'nb', 'pb', 'qb', 'rb', 'sb', 'tb', 'ub', 'xb', 'yb', 'ec', 'fc', 'gc', 'hc', 'ic', 'jc', 'kc', 'lc', 'mc', 'nc', 'oc', 'qc', 'rc', 'tc', 'uc', 'xc', 'yc', 'ad', 'ed', 'fd', 'gd', 'hd', 'id', 'jd', 'kd', 'ld', 'md', 'nd', 'od', 'pd', 'qd', 'rd', 'sd', 'td', 'ud', 'xd', 'yd', 'ae', 'ee', 'fe', 'ie', 'ke', 'ne', 'oe', 'qe', 're', 'se', 'ue', 'xe', 'ye', 'cf', 'ef', 'ff', 'gf', 'hf', 'if', 'jf', 'kf', 'lf', 'mf', 'nf', 'pf', 'qf', 'rf', 'sf', 'tf', 'uf', 'xf', 'yf', 'ag', 'cg', 'eg', 'fg', 'gg', 'hg', 'ig', 'jg', 'kg', 'lg', 'mg', 'ng', 'og', 'pg', 'qg', 'rg', 'sg', 'tg', 'ug', 'xg', 'yg', 'ah', 'ch', 'eh', 'fh', 'gh', 'hh', 'ih', 'jh', 'kh', 'lh', 'mh', 'nh', 'oh', 'ph', 'qh', 'rh', 'sh', 'uh', 'xh', 'yh', 'ai', 'ei', 'ii', 'ji', 'ni', 'oi', 'qi', 'si', 'ti', 'xi', 'yi', 'aj', 'cj', 'ej', 'fj', 'gj', 'hj', 'ij', 'jj', 'kj', 'lj', 'mj', 'nj', 'oj', 'pj', 'qj', 'rj', 'sj', 'tj', 'uj', 'xj', 'yj', 'ak', 'ck', 'ek', 'fk', 'gk', 'hk', 'jk', 'kk', 'lk', 'mk', 'nk', 'ok', 'qk', 'rk', 'sk', 'tk', 'uk', 'xk', 'yk', 'el', 'hl', 'il', 'jl', 'll', 'ml', 'nl', 'ol', 'ql', 'rl', 'sl', 'tl', 'ul', 'xl', 'yl', 'am', 'em', 'fm', 'gm', 'hm', 'im', 'jm', 'lm', 'nm', 'pm', 'qm', 'rm', 'sm', 'tm', 'um', 'xm', 'ym', 'an', 'cn', 'fn', 'gn', 'hn', 'jn', 'kn', 'ln', 'mn', 'nn', 'pn', 'qn', 'rn', 'sn', 'tn', 'un', 'xn', 'yn', 'ao', 'eo', 'io', 'oo', 'qo', 'uo', 'xo', 'yo', 'ap', 'cp', 'ep', 'fp', 'gp', 'hp', 'ip', 'jp', 'kp', 'lp', 'mp', 'np', 'pp', 'qp', 'rp', 'tp', 'xp', 'yp', 'aq', 'cq', 'eq', 'fq', 'gq', 'hq', 'iq', 'jq', 'kq', 'lq', 'mq', 'nq', 'oq', 'pq', 'qq', 'rq', 'sq', 'tq', 'uq', 'xq', 'yq', 'ar', 'hr', 'ir', 'jr', 'lr', 'mr', 'nr', 'or', 'qr', 'rr', 'sr', 'tr', 'ur', 'xr', 'yr', 'as', 'cs', 'es', 'fs', 'gs', 'hs', 'js', 'ks', 'ls', 'ms', 'ns', 'os', 'ps', 'qs', 'rs', 'ss', 'ts', 'us', 'xs', 'ys', 'at', 'ct', 'et', 'ft', 'gt', 'ht', 'it', 'jt', 'kt', 'lt', 'mt', 'nt', 'ot', 'pt', 'qt', 'rt', 'tt', 'ut', 'xt', 'yt', 'cu', 'eu', 'iu', 'ou', 'tu', 'uu', 'xu', 'yu', 'av', 'ev', 'fv', 'gv', 'hv', 'iv', 'jv', 'kv', 'lv', 'mv', 'nv', 'ov', 'pv', 'qv', 'rv', 'sv', 'uv', 'xv', 'yv', 'aw', 'cw', 'ew', 'fw', 'gw', 'hw', 'iw', 'jw', 'lw', 'mw', 'nw', 'ow', 'pw', 'qw', 'rw', 'sw', 'tw', 'xw', 'yw', 'ax', 'cx', 'fx', 'gx', 'hx', 'ix', 'jx', 'kx', 'lx', 'mx', 'nx', 'ox', 'px', 'qx', 'rx', 'sx', 'tx', 'ux', 'xx', 'yx', 'ay', 'cy', 'ey', 'fy', 'gy', 'iy', 'jy', 'ky', 'ly', 'ny', 'oy', 'qy', 'ry', 'sy', 'ty', 'uy', 'xy', 'yy', 'az', 'cz', 'ez', 'fz', 'gz', 'hz', 'iz', 'jz', 'kz', 'lz', 'mz', 'nz', 'oz', 'pz', 'qz', 'rz', 'sz', 'tz', 'uz', 'xz', 'yz']

Example 2

tagger.predict(Sentence("eedaflegging"))

OUTPUT

IndexError: index out of range in self

Expected behaivor

Expected the pipeline to return entities.

Logs and Stack traces

No response

Screenshots

No response

Additional Context

Same Issue as: https://github.com/flairNLP/flair/issues/2813

Environment

Versions:

Flair

0.12

Pytorch

1.13.1+cu116

Transformers

4.26.1

GPU

True

lennertvandevelde avatar Mar 07 '23 14:03 lennertvandevelde

Hi @lennertvandevelde looks like this happens, because the vocabulary of the bertje-embeddings, was updated after release see here.

@alanakbik I suppose we should just retrain the model.

In the meantime you can just use a hotfix by adding the embeddings for the tokens that wer added afterwards using the following code:

from flair.models import SequenceTagger
from flair.embeddings import TransformerWordEmbeddings
from torch import nn
import torch

tagger = SequenceTagger.load("flair/ner-dutch")
embeddings = TransformerWordEmbeddings('GroNLP/bert-base-dutch-cased')
new_embedding_tensor = torch.cat([tagger.embeddings.model.get_input_embeddings().weight, embeddings.model.get_input_embeddings().weight[tagger.embeddings.model.get_input_embeddings().num_embeddings:-1]])
new_input_embeddings = nn.Embedding.from_pretrained(new_embedding_tensor , freeze=False)
tagger.embeddings.model.set_input_embeddings(new_input_embeddings)
tagger.embeddings.base_model_name="GroNLP/bert-base-dutch-cased"
tagger.save("ner-dutch-fixed.pt")

helpmefindaname avatar Mar 13 '23 10:03 helpmefindaname

#nDD v nevzvvbqlekvl m s das Z umm Vvelvet

Nickyboo1194 avatar May 16 '23 18:05 Nickyboo1194

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 17 '23 01:09 stale[bot]