python-crfsuite
python-crfsuite copied to clipboard
tagger.info() cannot read multi-line matrices
Not sure this is a bug, or a feature request, but I shall still write it down as more people would definitely be experiencing this.
When the features of a crf also contain numpy matrices, like the word2vec
vector of a word; the tagger.info()
is not being able to recognize them, as the regex pattern is not recognized. The error thrown due to the unavailability of a detected regex group looks like this:
Traceback (most recent call last):
File "test_dumpparser.py", line 8, in <module>
parser.feed(line.decode('utf8'))
File "/Library/Python/2.7/site-packages/pycrfsuite/_dumpparser.py", line 62, in feed
getattr(self, 'parse_%s' % self.state)(line)
File "/Library/Python/2.7/site-packages/pycrfsuite/_dumpparser.py", line 74, in parse_ATTRIBUTES
self.result.attributes[m.group(2)] = m.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
This is due to the parsing logic in the _dumpparser.py
file.
A solution for that would be to encode the matrix and then, pass it into the CRF model as a feature.
(base64.b64encode(narray)
)
I was getting similar errors when I had features that included \n
characters or \s
, try replacing those by a special token, e.g.: #NEWLINE
, or #SPACE
This is still an issue
This is still an
I was getting similar errors when I had features that included
\n
characters or\s
, try replacing those by a special token, e.g.:#NEWLINE
, or#SPACE
This is still an issue and replacing \n characters or \s is not a solution. Since I need to find the original position of predicted token in original text. When i replace \n with in original text. This is no longer possible.