rita Problems with tokenizing hyphenated words

oft-cited off-site deeply-nested

~should be handled as 3 tokens in tokenizer~ should be handled as a single token in tokenizer

[x] 1. ritajs tests
[x] 1. ritajs fix
[x] 2. sync tests with java
[x] 3. sync fix with java

Jan 29 '21 06:01 dhowe

@KarlieZhao when you are working again, please remind me of the status of this...

May 03 '22 09:05 dhowe

should be fixed already.

May 03 '22 18:05 KarlieZhao

Not quite - tokenization needs to be consistent. That is, the tokenizer and the analyzer need to stay in sync -- we don't want the tokenizer returning one set of tokens and the analyzer another, as is currently the case; or some calls to the tokenizer returning one value, and other calls a different value.

So the following must always be true:

expect(RiTa.analyze(word)["tokens"].split(' ')).eq(RiTa.tokenize(word));

Also the current code doesn't handle words with multiple hyphens: e.g. 'over-the-counter', 'up-to-date' and 'state-of-the-art'.

So, how to solve... Ideally dashed words should be treated as single tokens in all cases. This is the correct solution. However it is not trivial (which is why I suggested the less-than-ideal solution of breaking them up):

Because dashed words are not included in the lexicon, we either need to add them (I'm not sure how many there are, but my guess is several thousand at least, perhaps infinite when considering numbers) OR to compose the dashed word features from their component parts, e.g., for 'deeply-nested', we separately look up the features for 'deeply' and 'nested' and then combine them. Also, for either case, we need to take into account hyphenated words where the letter-to-speech engine must handle one or more parts (when a part of the word is NOT in the lexicon).

I suppose a simple way to solve this would be to leave the tokenizer as is, and to disable the 'keepHyphen' option in the analyzer, though this is, as mentioned, less than ideal...

Consistent sets of tests below (should also include multiple-hyphen-words):

  // treating word as a single token:
  let feats = RiTa.analyze("off-site"); 

  eq(feats["pos"], "nn");
  eq(feats["tokens"], "off-site");
  eq(feats["phones"], 'ao-f-s-ay-t');
  eq(feats["stresses"], "1-1");
  eq(feats["syllables"], "ao-f/s-ay-t");
  eq(feats["tokens"].split(' '), RiTa.tokenize("off-site"));

  // treating parts as separate tokens:
  let feats = RiTa.analyze("off-site"); 

  eq(feats["pos"], "jj nn");
  eq(feats["tokens"], "off site");
  eq(feats["phones"], 'ao-f s-ay-t');
  eq(feats["stresses"], "1 1");
  eq(feats["syllables"], "ao-f s-ay-t");
  eq(feats["tokens"].split(' '), RiTa.tokenize("off-site"));

May 04 '22 07:05 dhowe

compose the dashed word features from their component parts

I think that can be a possible way for doing it (a similar technique is implemented in tagger, see https://github.com/dhowe/ritajs/blob/b5b447c300739928aabb7f6f83577493695a5512/src/tagger.js#L495 )

But before that is done, maybe just turnoff the keepHyphen for now

May 04 '22 07:05 Real-John-Cheung

but the question of words (specifically, parts of hyphenated words) that are not in the lexicon is still a problem...

pos case is a bit different as it doesn't require an additional lexicon lookup

May 04 '22 07:05 dhowe

But before that is done, maybe just turnoff the keepHyphen for now

Can one of you take care of this (@Real-John-Cheung, ramble is highest priority now) ?

[x] add tests above, plus others, including w multiple hyphens
[x] fix regex at src/tokenizer.js line 71
[x] disable keepHyphen flag
[x] verify all tests pass
[x] sync with java

May 04 '22 08:05 dhowe

do you have bandwidth @KarlieZhao ?

May 16 '22 08:05 dhowe

yes, I'll start working on this.

May 16 '22 19:05 KarlieZhao

I made an implementation for analysing hyphenated word as one (above PR). The performance is not bad (not causing exec time warning in larger test pool, take ~40ms to execute that group of tests independently). We can review then sync with Java

Jun 23 '22 09:06 Real-John-Cheung

@Real-John-Cheung can you summarize me what you did here? I notice some of the tests are not correct, for example (for "state-of-the-art"):

 eq(feats["syllables"], "s-t-ey-t-ah-v-dh-ah-aa-r-t");

And how is "state-of-the-art" being recognized as a noun-phrase, when the individual words have 4 separate parts-of -speech?

// it should be recognized as noun without context, as "state of the art" is a noun phrase - JC
eq(feats["pos"], "nn");

Also, how does this handle words not in the lexicon, or hyphenated words where one part is in the lexicon and the other isn't?

Jun 23 '22 10:06 dhowe

can you summarize me what you did here?

So basically this algorithm treat hyphenated word as a sort of "phrase": breaks it down to parts and treat each part as single word and search for their raw phones in the dictionary, then later combine the raw phones for phones, syllables, and stresses.

how is "state-of-the-art" being recognized as a noun-phrase, when the individual words have 4 separate parts-of -speech

The tagger part is a bit more complicated. There is a set of rules to decide the POS of a hyphenated word according to the POS of each part. In the case of "state-of-the-art", the algorithm recognize it as a noun because the first part of it ("state") can be a noun and the rest parts can be seen as the decoration/description of the noun.

how does this handle words not in the lexicon, or hyphenated words where one part is in the lexicon and the other isn't

generally for parts that not in the lexicon, using LTS for raw phones and POS of it will be 'nn'. I would say there are not so many cases that parts of the hyphenated word are not in the library.

Jun 23 '22 12:06 Real-John-Cheung

ok, so we need tests for hyphenated words where:

both parts are in lexicon
neither parts are in lexicon
one part is, the other isn't

also, we need to fix tests like this (should be 4 syllables): eq(feats["syllables"], "s-t-ey-t-ah-v-dh-ah-aa-r-t");

as well as cases where one of the parts is multi-syllabic

also cases with other POS, e.g.,

The not-for-profit company (jj)
Let's follow-up tomorrow (vb)
etc.

Jun 24 '22 11:06 dhowe

some more words here: https://gist.github.com/dhowe/b384269c1ef88c32482a695403b772dd

Jun 24 '22 11:06 dhowe

also, we need to fix tests like this (should be 4 syllables): eq(feats["syllables"], "s-t-ey-t-ah-v-dh-ah-aa-r-t");

just to confirm, the correct output should be s-t-ey-t/ah-v/dh-ah/aa-r-t ?

Jun 27 '22 02:06 Real-John-Cheung

I believe so, assuming those are the correct phonemes for the individual words

as this is a tricky issue with many parts, please leave a marker in places where you've added or removed code throughout, perhaps "HWF:" for 'hyphenated word fix'

Jun 27 '22 04:06 dhowe

Now has 75 tests in 4 pools:

pool1 : all parts in lexicon
poo2A: some parts not in lexicon but are variants of words in lexicon
poo2B: some parts not in lexicon and the missing parts are not related to any word in lexicon
poo3: all parts not in lexicon

most of the hyphenated word belong to pool1 and pool2A, only a few of them are in the other 2 pools.

tests for tagging hyphenated words in sentence are to be added

Jun 30 '22 10:06 Real-John-Cheung

@Real-John-Cheung I've merged this with some of my own refactors -- can you sync with java?

Jul 06 '22 02:07 dhowe

sure, I will finish the tests for hyphenated words in sentences for tagger first. then sync

Jul 06 '22 08:07 Real-John-Cheung

@Real-John-Cheung @KarlieZhao seems this fix for hyphens has broken RiTa on Safari (especially problematic for iOS), as the regex here uses lookbehinds which Safari does not support

See the ticket here: https://github.com/dhowe/rita/issues/189

Can one of you look into a fix? See this SO ticket as a place to start...

Oct 13 '22 07:10 dhowe

replaced it with .replace(/([a-zA-Z]+)-([a-zA-Z]+)/g, "$1 - $2"); should work on all browsers now (maybe not IE...)

Oct 13 '22 14:10 Real-John-Cheung

rita rita copied to clipboard

Problems with tokenizing hyphenated words

rita
rita copied to clipboard