pos-chunker
pos-chunker copied to clipboard
Consecutive Chunks not matched
Doesn't seem to match consecutive chunks in this case. There are double spaces in the tag string to preserve whitespace.
var tags = "It's/NN on/IN march/NN 5/CD th/DT ./. "
var ranks = {
ruleType: 'tokens',
pattern: '[ { tag:CD } ] [ { word:/th|st|nd/ } ]',
result: 'RANK'
};
var months = {
ruleType: 'tokens',
pattern: '[ { word:/[Jj]anuary|[Ff]ebruary|[Mm]arch|[Aa]pril|[Mm]ay|[Jj]une|[Jj]uly|[Aa]ugust|[Ss]eptember|[Oo]ctober|[Nn]ovember|[Dd]ecember/; tag:NNP? } ]',
result: 'MONTH'
};
var dates = {
ruleType: 'tokens',
pattern: '[ { chunk:"MONTH" } ] [ { chunk:"RANK" } ]',
result: 'DATE'
};
var chunks = chunker.chunk(
tags,
[months, ranks, dates]
);
console.log(chunks);
Output:
It's/NN on/IN [MONTH march/NN] [RANK 5/CD th/DT] ./.
It captured the two chunks MONTH and RANK, but not the consecutive pattern defined by dates.
Interesting...
From what I know about POS taggers they only leave one space between tokens. I've only used NLTK's tagger in earnest, so I'm basing a lot of my assumptions on that. Having said that, in writing this chunker I did do a lot of reading around, and I didn't come across any that preserved whitespace in this way.
So do you have a POS tagger that is preserving the spaces?
Or are you generating strings in some other way?
I guess it wouldn't hurt to allow multiple spaces, though. I think most of my expressions could be refined to do that. Let me know whether this really is a requirement, and if so, we can investigate further.
Which POS tagger are you using? I've just noticed that you have It's
as one word. Is your tagger leaving It's
like that, or have you manually created some strings to work with? I think it should be split into three components, with the apostrophe separated out:
It/PRP '/" s/PRP on/IN march/NNP 5/CD th/DT ./.
It should be split into 3 as you said. I wrote my own tokenizer which missed that, and then fed it into a modified version of pos-js (only modification was to .trim() tokens before trying to tag) to tag.
My issue with the current tokenizer / taggers is that white space from an original conversation will be totally lost. I want to preserve the spaces while recognizing the chunks.
E.g. In this line
var string = "It's on march 5th."
var tags = "It's/NN on/IN march/NN 5/CD th/DT ./. "
In my exercise, by default there is 1 space to separate tags. And 1 space from the original whitespace. Here there are 2 spaces between "on" and "march" but 1 space between "5" and "th".
What I'm trying to do is tag custom chunks such so they can be displayed differently without destroying whitespace, so working with multiple spaces is needed.
Ok...sounds like it's worth addressing, although I'm wondering if it should be addressed at the tokenising stage. A quick Google around shows that the Stanford tokeniser, for example, includes options to preserve the whitespace:
http://nlp.stanford.edu/software/tokenizer.shtml
On a few occasions whilst working on this chunker and some related modules, I've felt that pos-js doesn't do what I want, and resolving the issues might be more trouble than just writing a fresh tokeniser. I'm going to put a bit of time over the next few days into working out whether the answer is indeed to create a new tokeniser, or whether to solve your problem in this module.
Either way, we'll sort something out.
And by the way, thanks for your interest!