rita icon indicating copy to clipboard operation
rita copied to clipboard

Problems with tokenizing hyphenated words

Open dhowe opened this issue 4 years ago • 20 comments

oft-cited off-site deeply-nested

~should be handled as 3 tokens in tokenizer~ should be handled as a single token in tokenizer

  • [x] 1. ritajs tests
  • [x] 1. ritajs fix
  • [x] 2. sync tests with java
  • [x] 3. sync fix with java

dhowe avatar Jan 29 '21 06:01 dhowe

@KarlieZhao when you are working again, please remind me of the status of this...

dhowe avatar May 03 '22 09:05 dhowe

should be fixed already.

KarlieZhao avatar May 03 '22 18:05 KarlieZhao

Not quite - tokenization needs to be consistent. That is, the tokenizer and the analyzer need to stay in sync -- we don't want the tokenizer returning one set of tokens and the analyzer another, as is currently the case; or some calls to the tokenizer returning one value, and other calls a different value.

So the following must always be true:

expect(RiTa.analyze(word)["tokens"].split(' ')).eq(RiTa.tokenize(word));

Also the current code doesn't handle words with multiple hyphens: e.g. 'over-the-counter', 'up-to-date' and 'state-of-the-art'.

So, how to solve... Ideally dashed words should be treated as single tokens in all cases. This is the correct solution. However it is not trivial (which is why I suggested the less-than-ideal solution of breaking them up):

Because dashed words are not included in the lexicon, we either need to add them (I'm not sure how many there are, but my guess is several thousand at least, perhaps infinite when considering numbers) OR to compose the dashed word features from their component parts, e.g., for 'deeply-nested', we separately look up the features for 'deeply' and 'nested' and then combine them. Also, for either case, we need to take into account hyphenated words where the letter-to-speech engine must handle one or more parts (when a part of the word is NOT in the lexicon).

I suppose a simple way to solve this would be to leave the tokenizer as is, and to disable the 'keepHyphen' option in the analyzer, though this is, as mentioned, less than ideal...

Consistent sets of tests below (should also include multiple-hyphen-words):

  // treating word as a single token:
  let feats = RiTa.analyze("off-site"); 

  eq(feats["pos"], "nn");
  eq(feats["tokens"], "off-site");
  eq(feats["phones"], 'ao-f-s-ay-t');
  eq(feats["stresses"], "1-1");
  eq(feats["syllables"], "ao-f/s-ay-t");
  eq(feats["tokens"].split(' '), RiTa.tokenize("off-site"));
  // treating parts as separate tokens:
  let feats = RiTa.analyze("off-site"); 

  eq(feats["pos"], "jj nn");
  eq(feats["tokens"], "off site");
  eq(feats["phones"], 'ao-f s-ay-t');
  eq(feats["stresses"], "1 1");
  eq(feats["syllables"], "ao-f s-ay-t");
  eq(feats["tokens"].split(' '), RiTa.tokenize("off-site"));

dhowe avatar May 04 '22 07:05 dhowe

compose the dashed word features from their component parts

I think that can be a possible way for doing it (a similar technique is implemented in tagger, see https://github.com/dhowe/ritajs/blob/b5b447c300739928aabb7f6f83577493695a5512/src/tagger.js#L495 )

But before that is done, maybe just turnoff the keepHyphen for now

Real-John-Cheung avatar May 04 '22 07:05 Real-John-Cheung

but the question of words (specifically, parts of hyphenated words) that are not in the lexicon is still a problem...

pos case is a bit different as it doesn't require an additional lexicon lookup

dhowe avatar May 04 '22 07:05 dhowe

But before that is done, maybe just turnoff the keepHyphen for now

Can one of you take care of this (@Real-John-Cheung, ramble is highest priority now) ?

  • [x] add tests above, plus others, including w multiple hyphens
  • [x] fix regex at src/tokenizer.js line 71
  • [x] disable keepHyphen flag
  • [x] verify all tests pass
  • [x] sync with java

dhowe avatar May 04 '22 08:05 dhowe

do you have bandwidth @KarlieZhao ?

dhowe avatar May 16 '22 08:05 dhowe

yes, I'll start working on this.

KarlieZhao avatar May 16 '22 19:05 KarlieZhao

I made an implementation for analysing hyphenated word as one (above PR). The performance is not bad (not causing exec time warning in larger test pool, take ~40ms to execute that group of tests independently). We can review then sync with Java

Real-John-Cheung avatar Jun 23 '22 09:06 Real-John-Cheung

@Real-John-Cheung can you summarize me what you did here? I notice some of the tests are not correct, for example (for "state-of-the-art"):

 eq(feats["syllables"], "s-t-ey-t-ah-v-dh-ah-aa-r-t");

And how is "state-of-the-art" being recognized as a noun-phrase, when the individual words have 4 separate parts-of -speech?

// it should be recognized as noun without context, as "state of the art" is a noun phrase - JC
eq(feats["pos"], "nn"); 

Also, how does this handle words not in the lexicon, or hyphenated words where one part is in the lexicon and the other isn't?

dhowe avatar Jun 23 '22 10:06 dhowe

can you summarize me what you did here?

So basically this algorithm treat hyphenated word as a sort of "phrase": breaks it down to parts and treat each part as single word and search for their raw phones in the dictionary, then later combine the raw phones for phones, syllables, and stresses.

how is "state-of-the-art" being recognized as a noun-phrase, when the individual words have 4 separate parts-of -speech

The tagger part is a bit more complicated. There is a set of rules to decide the POS of a hyphenated word according to the POS of each part. In the case of "state-of-the-art", the algorithm recognize it as a noun because the first part of it ("state") can be a noun and the rest parts can be seen as the decoration/description of the noun.

how does this handle words not in the lexicon, or hyphenated words where one part is in the lexicon and the other isn't

generally for parts that not in the lexicon, using LTS for raw phones and POS of it will be 'nn'. I would say there are not so many cases that parts of the hyphenated word are not in the library.

Real-John-Cheung avatar Jun 23 '22 12:06 Real-John-Cheung

ok, so we need tests for hyphenated words where:

  • both parts are in lexicon
  • neither parts are in lexicon
  • one part is, the other isn't

also, we need to fix tests like this (should be 4 syllables): eq(feats["syllables"], "s-t-ey-t-ah-v-dh-ah-aa-r-t");

as well as cases where one of the parts is multi-syllabic

also cases with other POS, e.g.,

  • The not-for-profit company (jj)
  • Let's follow-up tomorrow (vb)
  • etc.

dhowe avatar Jun 24 '22 11:06 dhowe

some more words here: https://gist.github.com/dhowe/b384269c1ef88c32482a695403b772dd

dhowe avatar Jun 24 '22 11:06 dhowe

also, we need to fix tests like this (should be 4 syllables): eq(feats["syllables"], "s-t-ey-t-ah-v-dh-ah-aa-r-t");

just to confirm, the correct output should be s-t-ey-t/ah-v/dh-ah/aa-r-t ?

Real-John-Cheung avatar Jun 27 '22 02:06 Real-John-Cheung

I believe so, assuming those are the correct phonemes for the individual words

as this is a tricky issue with many parts, please leave a marker in places where you've added or removed code throughout, perhaps "HWF:" for 'hyphenated word fix'

dhowe avatar Jun 27 '22 04:06 dhowe

Now has 75 tests in 4 pools:

  • pool1 : all parts in lexicon
  • poo2A: some parts not in lexicon but are variants of words in lexicon
  • poo2B: some parts not in lexicon and the missing parts are not related to any word in lexicon
  • poo3: all parts not in lexicon

most of the hyphenated word belong to pool1 and pool2A, only a few of them are in the other 2 pools.

tests for tagging hyphenated words in sentence are to be added

Real-John-Cheung avatar Jun 30 '22 10:06 Real-John-Cheung

@Real-John-Cheung I've merged this with some of my own refactors -- can you sync with java?

dhowe avatar Jul 06 '22 02:07 dhowe

sure, I will finish the tests for hyphenated words in sentences for tagger first. then sync

Real-John-Cheung avatar Jul 06 '22 08:07 Real-John-Cheung

@Real-John-Cheung @KarlieZhao seems this fix for hyphens has broken RiTa on Safari (especially problematic for iOS), as the regex here uses lookbehinds which Safari does not support

See the ticket here: https://github.com/dhowe/rita/issues/189

Can one of you look into a fix? See this SO ticket as a place to start...

dhowe avatar Oct 13 '22 07:10 dhowe

replaced it with .replace(/([a-zA-Z]+)-([a-zA-Z]+)/g, "$1 - $2"); should work on all browsers now (maybe not IE...)

Real-John-Cheung avatar Oct 13 '22 14:10 Real-John-Cheung