redshift icon indicating copy to clipboard operation
redshift copied to clipboard

Greedy or not?

Open rlvoyer opened this issue 11 years ago • 3 comments

Hi there! This isn't an issue, so much as a question.

I came across your neat POS tagger implementation by way of this blog post:

https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/

I'm curious... In the post, you describe the greedy implementation and argue that it's plenty accurate, but the implementation here in your repo actually uses a beam search. Do you have accuracy numbers for this implementation?

rlvoyer avatar Dec 09 '14 06:12 rlvoyer

I actually implemented this beam tagger before I wrote that blog post. This one happens to be more accurate than the one in the blog post, mostly because it uses Brown cluster features and case frequency features. It's possible to make a greedy tagger perform about as well on English. Maybe there are other languages where the beam makes more of a difference.

I use this tagger with the parser for academic reasons. When I'm publishing about the parser, I prefer to have as many of the details match previous systems, which tend to use beam-search taggers.

syllog1sm avatar Dec 13 '14 16:12 syllog1sm

Makes sense -- thanks! Do you have accuracy numbers? Also, do you have any documentation/tips on how to play around with the feature set? I can poke around in the code, but if you any suggestions, that'd be great. Thanks.

rlvoyer avatar Dec 17 '14 03:12 rlvoyer

re: Accuracy numbers

Accuracy for taggers tops out at 97.1 for WSJ 19-21 and 97.3 WSJ 22-24. You get more movement on out-of-domain data. See this paper for a thoughtful look at what it might take to get us above this peak:

http://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf

This is really the one paper I recommend people read on POS tagging.

re: Features

I've set this up in a way that I find very easy; let me know how you go with it.

Background: since we're using a linear model, there are two parts to feature extraction:

  1. Extract a set of atomic boolean values, which we think are a good way to represent the context;

  2. Create conjunction features that ask about the state of multiple atomic values.

In _tagger_features.pyx , we have a big enum naming the atomic values, and a function fill_context that extracts all the atomic features into one big array. The enum will have an entry like N1w, i.e. the word-form of the next token, which we use as the index of that value in the array. So, we set context[N1w] = tokens[i+1].word, etc.

The actual feature-templates are defined as tuples of these indices. So, to add that feature, we define a tuple like (N1w,), and we can also add another feature (N1w, N1p), which is the conjunction of the next word and its POS tag. The template definitions are given to thinc.features.Extractor on initialization, so that the templates can be extracted efficiently.

thinc.features.Extractor.get_feats is given the context array, and returns an array of Feature structs, which contain the hashes of the template-values. This is then given to thinc.learner.LinearModel.get_scores.

So:

  1. To add new atomic properties, extend the enum of feature names, and fill the resulting slot in fill_context.

  2. To actually add the feature, define a tuple of the values you want in your feature-template, and ensure the template is passed to thinc.features.Extractor on initialization.

  3. If you want to ignore this machinery, and do something ad hoc yourself, you can create your own Feature* array, possibly copying values in from the one created by thinc.features.Extractor. Each Feature struct requires a key, an index for its template "slot", and a value. The slots allow you to provide multiple values for a single template; like, you could add multiple values of your (N1w, N1p) template feature, if you wanted to allow multiple POS tags for the word.

syllog1sm avatar Dec 17 '14 11:12 syllog1sm