crfsuite icon indicating copy to clipboard operation
crfsuite copied to clipboard

Character n-grams

Open vboton opened this issue 6 years ago • 2 comments

I have a training data where each token is a word and I've already extracted a few features like NER, POS and CHUNK for each token. But I have a problem when I try to extract character n-grams features. Since this features are computed at a character level, I don't know how to represent their values following the attribute value format. For example, if the current token is "food" then its character bigram feature will be something like "fo, oo, od". So how do I have to format that feature? By writing something like bigram[0]=fo, oo, od??

vboton avatar Jul 06 '18 11:07 vboton

One way of doing this is assigning bigram as your key and bool True as its value:

features['fo'] = True features['oo'] = True features[''od'] = True

If you want to also consider position of the bigram, then it would be something like

features['fo_word_prefix'] = True features['oo_word_middle'] = True features['od_word_suffix'] = True

For reference https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb have a look at features['BOS'] = True in function word2features

On Fri, Jul 6, 2018 at 5:09 PM, yamivicen [email protected] wrote:

I have a training data where each token is a word and I've already extracted a few features like NER, POS and CHUNK for each token. But I have a problem when I try to extract character n-grams features. Since this features are computed at a character level, I don't know how to represent their values following the attribute value format. For example, if the current token is "food" then its character bigram feature will be something like "fo, oo, od". So how do I have to format that feature? By writing something like bigram[0]=fo, oo, od??

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chokkan/crfsuite/issues/103, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfs1mCu1CR6MYNzgTeszY64FBu9zWTks5uD0yJgaJpZM4VFUot .

kaushikacharya avatar Jul 06 '18 14:07 kaushikacharya

Take a look at Standford NLP NER features. These features are quite useful in morphologically rich languages like Finnsih, Turkish, Russian and others.

You can write word "food" prefixes like:

  • ^f
  • ^fo
  • ^foo

And the suffixes:

  • d$#
  • od$#
  • ood$#

I don't remember the exact start and end flags but you get the idea.

usptact avatar Jul 06 '18 15:07 usptact