McBopomofo icon indicating copy to clipboard operation
McBopomofo copied to clipboard

Covering ambiguity improvements

Open lukhnos opened this issue 2 years ago • 1 comments

Thanks for the insightful contribution—I could run the analysis without any problem and didn't realize the prevalence of, for example, conflicting/shadowing 3-syllable scripts such as (電子式) vs (電子)(是) and (微電腦) vs (為)(電腦).

Glad to hear that it works for you. BTW, technically those are "covering" ambiguities. Conflicting ambiguities are more like (成分)(子) vs. (成)(分子). Not really important for the discussion here, though.

Is there anything you think we can pursue? The improvements in #329 should mitigate the issues with 電子式 and 微電腦 once the user has chosen the candidate, but it's still odd that phrases like 工作證 are not chosen in the first place. Is there anything we can do for those?

Yes, there are some options. IMO. They are not mutually exclusive and probably only matters of difficulty and urgency.

  • [ ] 1. Increase logprobs artificially for desired outcomes
    • [ ] a. By the naked eye
    • [ ] b. By an algorithm (the problem itself can be NP-Complete)
  • [ ] 2. Implement Katz's back-off model
  • [ ] 3. Recalculate logprobs with more data
  • [ ] 4. Recalculate logprobs with an unsupervised segmentation algorithm

I will elaborate (1.b), (3), and (4) at the bottom of this comment. Also, some of the above, especially (4), are related to https://github.com/openvanilla/McBopomofo/blob/master/Source/Data/bin/buildFreq.py#L51

Meanwhile, it would be nice to enhance https://github.com/openvanilla/McBopomofo/blob/master/Source/Data/bin/self-score-test.py with the approach of this PR and #329. However, since GitHub Actions limit the free running hours, the implementation of this PR can use a faster algorithm for CI/CD, which is why I marked this PR as a draft. Or even better, test the engine directly.

For #300, it is also recommended to have user-defined scores included in the test. And then one may see whether it is necessary to change the scoring function or even the entire online learning algorithm.


Elaborations

  • (1.b) Let's say we prefer 工作證 over 工作-政 and change some of their logprobs. A possible side effect is that 作證 can sometimes be demoted. (Even worse, 員工 of 員工-作證 can be disrupted, too.) One can formulate this situation as a rank aggregation problem in math and economy, where a "rational" ranking means no cycle. For example, when asked for the preference of coffee, tea, or juice, a rational ranking should not be this cycle: coffee > tea, tea > juice, but juice > coffee. Unfortunately, when there are more than 3 options to rank, detecting cycles is NP-Complete. That being said, approximate algorithms are usually sufficiently fast and good.
  • (3) Thanks Common Crawl, there are many more open data based on it now. For example, https://oscar-corpus.com/post/oscar-v22-01/.
  • (4) One can run some unsupervised segmentation algorithm that takes https://github.com/openvanilla/McBopomofo/blob/master/Source/Data/bin/buildFreq.py#L51 into account. Conceptual steps are:
    • The unsupervised algorithm finds frequent strings (up to a certain length) without word segmentation (supervision);
    • Discount substrings. For example, one may find "Rail Enquiries" in a (small) corpus where they are all covered by "National Rail Enquiries" or "British Rail Enquiries," such that the count of "Rail Enquiries" should be discounted to zero;
    • Apply new counts to exactly matched lexicons, and throw others away for now, although the distribution will be skewed.

(4) can help a lot if using (3), since (3) usually doesn't have segmentations.

Originally posted by @tianjianjiang in https://github.com/openvanilla/McBopomofo/issues/330#issuecomment-1186871718

lukhnos avatar Jul 29 '22 06:07 lukhnos

Gosh it's been almost a year... I am going to refresh my recollection on this and see what I can do.

tianjianjiang avatar Jun 17 '23 07:06 tianjianjiang