python-crfsuite
python-crfsuite copied to clipboard
Add float list features in ItemSequence
- New class FloatFeatures defined in
pycrfsuite._float_features - Useful for adding word embedding features
- Wrap the word embedding list of float values with FloatFeatures class
- The api supports the existing extraction of features from nested dicts.
- Added test case in
tests/test_itemsequence.py
Example usage:
import pycrfsuite
from pycrfsuite._float_features import FloatFeatures as FF
seq = pycrfsuite.ItemSequence([
{"w2v": FF([1., 2., 3.])},
{"w2v": FF([-1., 5, 4.])}
])
Fixes #37 Closes #39
Thanks for taking look at it!
I think it'd be nice to support numpy arrays and/or stdlib arrays instead of creating a custom wrapper, because this is a format you usually get embeddings in in a first place. If user has a list of floats it is easy to convert to array.array if needed.
numpy support is a bit more problematic because it may require a numpy dependency, which I'd like to avoid in python-crfsuite, but maybe it is possible to implement it somehow using memoryviews.
Also, could you please add generated cpp files in a separate commit, or just remove them? Travis tests should pass even without generated cpp files because they are generated on travis before each test run.
@kmike I was also thinking of using numpy array but decided against it once I realized numpy was not a dependency in crfsuite. I don't think using numpy array will be beneficial as everything eventually gets mapped to the ItemSequence object using the crfsuite_api.
My rationale behind using the custom wrapper was to allow for the only other possible feature which CRFSuite can take and that would be list of floats. If there is any other type of list then it should always be incompatible. This explicit mapping will allow for word vectors and other vector embedding (say thought vectors from encoder-decoder) to be used as a native type in CRFSuite. I might be wrong here but this is what I could understand from my brief overview of python-CRFSuite code.
Array.array sounds like a good idea (I have never used these), but I am not sure if that would work as the user will have to eventually wrap their arrays into array.array.
I didn't know about not generating CPP files. I think if that was the case then the CPP files should be in .gitignore, what do you think ?