sklearn-crfsuite icon indicating copy to clipboard operation
sklearn-crfsuite copied to clipboard

API Compatibility with Numpy Arrays and Scipy Matricies for features

Open uwaisiqbal opened this issue 8 years ago • 3 comments

At the moment the library only accepts a list of feature dictionaries which for our purposes can consume an enormous amount of memory even when using generators. Would it be possible to extend the API to accept numpy arrays or scipy sparse matricies generated from the sklearn DictVectorizer?

uwaisiqbal avatar Jun 28 '17 14:06 uwaisiqbal

@oasis789 crfsuite implements vectorization itself, that's why dicts are currently exposed. I wonder why do you prefer DictVectorizer - sklearn-crfsuite data format is largely compatible, with a few extra features usable for sequential models.

It could be possible to implement what you're suggesting usin crfsuite C API (https://github.com/jakevdp/pyCRFsuite did that), but it requires wor.

See also: https://github.com/scrapinghub/python-crfsuite/pull/38

kmike avatar Jun 28 '17 17:06 kmike

I wanted to put together a pipeline for feature generation that would include the crf model making use of sklearn feature unions. The feature unions concatenate the output of transformations in the form of spares matrices. I wanted to be able to feed this directly to the crf model within the pipeline.

uwaisiqbal avatar Jun 29 '17 14:06 uwaisiqbal

hi @kmike are floats used as features in dictionaries taken as they are or do they suffer any transformation? I'm asking because I'm concerned with data sparcity, for example if I encode my feature in a [-1, 1] range I wouldn't like the vectorizer to create a single feature for each single possible value.

albertoandreottiATgmail avatar Jul 11 '18 05:07 albertoandreottiATgmail