python-crfsuite
python-crfsuite copied to clipboard
Add support for word embedding like features which are list of floats
The current API doesn't support adding features which are list of floats e.g. Word Embeddings. The current approach to add these features is to do something like {"f0": 1.5, "f1": 1.6, "f2": -1.4} for 3 dimensional embedding features, which adds extra burden on the user's part.
I propose a wrapper feature which will allow users to pass the word embedding list as the value of the dictionary. E.g. {"f": FloatFeatures([1.5, 1.6, -1.4])}, internally this will convert the float features into a representation consistent with the CRFSuite ItemSequence and having a consistent naming convention like "f:0", "f:1", "f:2".
@kmike and @tpeng do you want to have a look at it?
Using word embeddings improve accuracy a lot. Having a supported way to include them in python-crfsuite would be wonderful.
@napsternxg any updates on feeding float vectors as features? i have the same situation where i want to use glove embeddings for a NER task using crf.
@muhnash0 I basically did the proposed approach in my comment manually. It was quite easy.
I don't think the proposed approach will work. CRFsuite does not support continuous features so each unique key/value combination will be a unique feature. You have to discretize the continuous features with a technique like https://arxiv.org/abs/1711.01068
@DomHudson crfsuite does support continuous features
The approach I suggested is utilized in this tool I have built.
https://github.com/napsternxg/TwitterNER