rita
rita copied to clipboard
Problems with tokenizing hyphenated words
oft-cited off-site deeply-nested
~should be handled as 3 tokens in tokenizer~ should be handled as a single token in tokenizer
- [x] 1. ritajs tests
- [x] 1. ritajs fix
- [x] 2. sync tests with java
- [x] 3. sync fix with java
@KarlieZhao when you are working again, please remind me of the status of this...
should be fixed already.
Not quite - tokenization needs to be consistent. That is, the tokenizer and the analyzer need to stay in sync -- we don't want the tokenizer returning one set of tokens and the analyzer another, as is currently the case; or some calls to the tokenizer returning one value, and other calls a different value.
So the following must always be true:
expect(RiTa.analyze(word)["tokens"].split(' ')).eq(RiTa.tokenize(word));
Also the current code doesn't handle words with multiple hyphens: e.g. 'over-the-counter', 'up-to-date' and 'state-of-the-art'.
So, how to solve... Ideally dashed words should be treated as single tokens in all cases. This is the correct solution. However it is not trivial (which is why I suggested the less-than-ideal solution of breaking them up):
Because dashed words are not included in the lexicon, we either need to add them (I'm not sure how many there are, but my guess is several thousand at least, perhaps infinite when considering numbers) OR to compose the dashed word features from their component parts, e.g., for 'deeply-nested', we separately look up the features for 'deeply' and 'nested' and then combine them. Also, for either case, we need to take into account hyphenated words where the letter-to-speech engine must handle one or more parts (when a part of the word is NOT in the lexicon).
I suppose a simple way to solve this would be to leave the tokenizer as is, and to disable the 'keepHyphen' option in the analyzer, though this is, as mentioned, less than ideal...
Consistent sets of tests below (should also include multiple-hyphen-words):
// treating word as a single token:
let feats = RiTa.analyze("off-site");
eq(feats["pos"], "nn");
eq(feats["tokens"], "off-site");
eq(feats["phones"], 'ao-f-s-ay-t');
eq(feats["stresses"], "1-1");
eq(feats["syllables"], "ao-f/s-ay-t");
eq(feats["tokens"].split(' '), RiTa.tokenize("off-site"));
// treating parts as separate tokens:
let feats = RiTa.analyze("off-site");
eq(feats["pos"], "jj nn");
eq(feats["tokens"], "off site");
eq(feats["phones"], 'ao-f s-ay-t');
eq(feats["stresses"], "1 1");
eq(feats["syllables"], "ao-f s-ay-t");
eq(feats["tokens"].split(' '), RiTa.tokenize("off-site"));
compose the dashed word features from their component parts
I think that can be a possible way for doing it (a similar technique is implemented in tagger, see https://github.com/dhowe/ritajs/blob/b5b447c300739928aabb7f6f83577493695a5512/src/tagger.js#L495 )
But before that is done, maybe just turnoff the keepHyphen
for now
but the question of words (specifically, parts of hyphenated words) that are not in the lexicon is still a problem...
pos case is a bit different as it doesn't require an additional lexicon lookup
But before that is done, maybe just turnoff the
keepHyphen
for now
Can one of you take care of this (@Real-John-Cheung, ramble is highest priority now) ?
- [x] add tests above, plus others, including w multiple hyphens
- [x] fix regex at src/tokenizer.js line 71
- [x] disable
keepHyphen
flag - [x] verify all tests pass
- [x] sync with java
do you have bandwidth @KarlieZhao ?
yes, I'll start working on this.
I made an implementation for analysing hyphenated word as one (above PR). The performance is not bad (not causing exec time warning in larger test pool, take ~40ms to execute that group of tests independently). We can review then sync with Java
@Real-John-Cheung can you summarize me what you did here? I notice some of the tests are not correct, for example (for "state-of-the-art"):
eq(feats["syllables"], "s-t-ey-t-ah-v-dh-ah-aa-r-t");
And how is "state-of-the-art" being recognized as a noun-phrase, when the individual words have 4 separate parts-of -speech?
// it should be recognized as noun without context, as "state of the art" is a noun phrase - JC
eq(feats["pos"], "nn");
Also, how does this handle words not in the lexicon, or hyphenated words where one part is in the lexicon and the other isn't?
can you summarize me what you did here?
So basically this algorithm treat hyphenated word as a sort of "phrase": breaks it down to parts and treat each part as single word and search for their raw phones in the dictionary, then later combine the raw phones for phones, syllables, and stresses.
how is "state-of-the-art" being recognized as a noun-phrase, when the individual words have 4 separate parts-of -speech
The tagger part is a bit more complicated. There is a set of rules to decide the POS of a hyphenated word according to the POS of each part. In the case of "state-of-the-art", the algorithm recognize it as a noun because the first part of it ("state") can be a noun and the rest parts can be seen as the decoration/description of the noun.
how does this handle words not in the lexicon, or hyphenated words where one part is in the lexicon and the other isn't
generally for parts that not in the lexicon, using LTS for raw phones and POS of it will be 'nn'. I would say there are not so many cases that parts of the hyphenated word are not in the library.
ok, so we need tests for hyphenated words where:
- both parts are in lexicon
- neither parts are in lexicon
- one part is, the other isn't
also, we need to fix tests like this (should be 4 syllables):
eq(feats["syllables"], "s-t-ey-t-ah-v-dh-ah-aa-r-t");
as well as cases where one of the parts is multi-syllabic
also cases with other POS, e.g.,
- The not-for-profit company (jj)
- Let's follow-up tomorrow (vb)
- etc.
some more words here: https://gist.github.com/dhowe/b384269c1ef88c32482a695403b772dd
also, we need to fix tests like this (should be 4 syllables):
eq(feats["syllables"], "s-t-ey-t-ah-v-dh-ah-aa-r-t");
just to confirm, the correct output should be s-t-ey-t/ah-v/dh-ah/aa-r-t
?
I believe so, assuming those are the correct phonemes for the individual words
as this is a tricky issue with many parts, please leave a marker in places where you've added or removed code throughout, perhaps "HWF:" for 'hyphenated word fix'
Now has 75 tests in 4 pools:
- pool1 : all parts in lexicon
- poo2A: some parts not in lexicon but are variants of words in lexicon
- poo2B: some parts not in lexicon and the missing parts are not related to any word in lexicon
- poo3: all parts not in lexicon
most of the hyphenated word belong to pool1 and pool2A, only a few of them are in the other 2 pools.
tests for tagging hyphenated words in sentence are to be added
@Real-John-Cheung I've merged this with some of my own refactors -- can you sync with java?
sure, I will finish the tests for hyphenated words in sentences for tagger first. then sync
@Real-John-Cheung @KarlieZhao seems this fix for hyphens has broken RiTa on Safari (especially problematic for iOS), as the regex here uses lookbehinds which Safari does not support
See the ticket here: https://github.com/dhowe/rita/issues/189
Can one of you look into a fix? See this SO ticket as a place to start...
replaced it with .replace(/([a-zA-Z]+)-([a-zA-Z]+)/g, "$1 - $2");
should work on all browsers now (maybe not IE...)