python-crfsuite icon indicating copy to clipboard operation
python-crfsuite copied to clipboard

tagger.info() cannot read multi-line matrices

Open Dawny33 opened this issue 6 years ago • 3 comments

Not sure this is a bug, or a feature request, but I shall still write it down as more people would definitely be experiencing this.

When the features of a crf also contain numpy matrices, like the word2vec vector of a word; the tagger.info() is not being able to recognize them, as the regex pattern is not recognized. The error thrown due to the unavailability of a detected regex group looks like this:

Traceback (most recent call last):
  File "test_dumpparser.py", line 8, in <module>
    parser.feed(line.decode('utf8'))
  File "/Library/Python/2.7/site-packages/pycrfsuite/_dumpparser.py", line 62, in feed
    getattr(self, 'parse_%s' % self.state)(line)
  File "/Library/Python/2.7/site-packages/pycrfsuite/_dumpparser.py", line 74, in parse_ATTRIBUTES
    self.result.attributes[m.group(2)] = m.group(1)
AttributeError: 'NoneType' object has no attribute 'group'

This is due to the parsing logic in the _dumpparser.py file.

A solution for that would be to encode the matrix and then, pass it into the CRF model as a feature. (base64.b64encode(narray))

Dawny33 avatar Oct 09 '17 11:10 Dawny33

I was getting similar errors when I had features that included \n characters or \s, try replacing those by a special token, e.g.: #NEWLINE, or #SPACE

davidsbatista avatar Oct 18 '17 08:10 davidsbatista

This is still an issue

radostyle avatar Apr 02 '20 03:04 radostyle

This is still an

I was getting similar errors when I had features that included \n characters or \s, try replacing those by a special token, e.g.: #NEWLINE, or #SPACE

This is still an issue and replacing \n characters or \s is not a solution. Since I need to find the original position of predicted token in original text. When i replace \n with in original text. This is no longer possible.

aimlnerd avatar May 30 '22 13:05 aimlnerd