pos-chunker icon indicating copy to clipboard operation
pos-chunker copied to clipboard

Consecutive Chunks not matched

Open gmonaie opened this issue 9 years ago • 5 comments

Doesn't seem to match consecutive chunks in this case. There are double spaces in the tag string to preserve whitespace.

var tags = "It's/NN  on/IN  march/NN  5/CD th/DT ./. "

var ranks = {
  ruleType: 'tokens',
  pattern: '[ { tag:CD } ] [ { word:/th|st|nd/ } ]',
  result: 'RANK'
};

var months = {
  ruleType: 'tokens',
  pattern: '[ { word:/[Jj]anuary|[Ff]ebruary|[Mm]arch|[Aa]pril|[Mm]ay|[Jj]une|[Jj]uly|[Aa]ugust|[Ss]eptember|[Oo]ctober|[Nn]ovember|[Dd]ecember/; tag:NNP? } ]',
  result: 'MONTH'
};

var dates = {
  ruleType: 'tokens',
  pattern: '[ { chunk:"MONTH" } ] [ { chunk:"RANK" } ]',
  result: 'DATE'
};

var chunks = chunker.chunk(
  tags,
  [months, ranks, dates]
);

console.log(chunks);

Output:

It's/NN  on/IN  [MONTH march/NN]  [RANK 5/CD th/DT] ./. 

It captured the two chunks MONTH and RANK, but not the consecutive pattern defined by dates.

gmonaie avatar Mar 09 '15 18:03 gmonaie

Interesting...

From what I know about POS taggers they only leave one space between tokens. I've only used NLTK's tagger in earnest, so I'm basing a lot of my assumptions on that. Having said that, in writing this chunker I did do a lot of reading around, and I didn't come across any that preserved whitespace in this way.

So do you have a POS tagger that is preserving the spaces?

Or are you generating strings in some other way?

I guess it wouldn't hurt to allow multiple spaces, though. I think most of my expressions could be refined to do that. Let me know whether this really is a requirement, and if so, we can investigate further.

markbirbeck avatar Mar 09 '15 18:03 markbirbeck

Which POS tagger are you using? I've just noticed that you have It's as one word. Is your tagger leaving It's like that, or have you manually created some strings to work with? I think it should be split into three components, with the apostrophe separated out:

It/PRP '/" s/PRP on/IN march/NNP 5/CD th/DT ./.

markbirbeck avatar Mar 09 '15 18:03 markbirbeck

It should be split into 3 as you said. I wrote my own tokenizer which missed that, and then fed it into a modified version of pos-js (only modification was to .trim() tokens before trying to tag) to tag.

My issue with the current tokenizer / taggers is that white space from an original conversation will be totally lost. I want to preserve the spaces while recognizing the chunks.

E.g. In this line

var string = "It's on march 5th."
var tags = "It's/NN  on/IN  march/NN  5/CD th/DT ./. "

In my exercise, by default there is 1 space to separate tags. And 1 space from the original whitespace. Here there are 2 spaces between "on" and "march" but 1 space between "5" and "th".

gmonaie avatar Mar 09 '15 18:03 gmonaie

What I'm trying to do is tag custom chunks such so they can be displayed differently without destroying whitespace, so working with multiple spaces is needed.

gmonaie avatar Mar 09 '15 18:03 gmonaie

Ok...sounds like it's worth addressing, although I'm wondering if it should be addressed at the tokenising stage. A quick Google around shows that the Stanford tokeniser, for example, includes options to preserve the whitespace:

http://nlp.stanford.edu/software/tokenizer.shtml

On a few occasions whilst working on this chunker and some related modules, I've felt that pos-js doesn't do what I want, and resolving the issues might be more trouble than just writing a fresh tokeniser. I'm going to put a bit of time over the next few days into working out whether the answer is indeed to create a new tokeniser, or whether to solve your problem in this module.

Either way, we'll sort something out.

And by the way, thanks for your interest!

markbirbeck avatar Mar 09 '15 18:03 markbirbeck