keyman icon indicating copy to clipboard operation
keyman copied to clipboard

feat(web): dictionary-based wordbreaking - draft form with demo page 📚

Open jahorton opened this issue 1 year ago • 10 comments

This PR currently represents an effort to resolve #5025. It's a significant part of the way there, but there's notably more to do before we can call the effort complete. I actually have a lot of intermediate branches "ready to drop" if and when we want to do a proper review, etc of these changes.

Some of these have now been spun off into their own PRs:

  • https://github.com/keymanapp/keyman/pull/11867
  • https://github.com/keymanapp/keyman/pull/11868
  • https://github.com/keymanapp/keyman/pull/11869
  • https://github.com/keymanapp/keyman/pull/11870

That said, I still have four more distinct branches comprising this one PR.

  • One for the actual, initial implementation
  • One for establishing a unit test; we need the ability to draw in and read from a compiled Trie to use as wordbreaker's dictionary
  • One to handle parts of the context not found in the dictionary
  • And finally, one to build the demo page seen in the examples below.

Known tasks that would still need to be done:

  • The new wordbreaker cannot yet be specified for use by any Developer-compiled lexical model

    • This is not straight-forward - our existing Trie model pattern requires that the wordbreaker be built first... but this wordbreaker requires a pre-built Trie to be provided to its constructor.
    • Current thought: we could consider building the core Trie first, then providing it separately to both the TrieModel and the wordbreaker.
  • If a typo occurs while typing a long word that has a prefix which is a valid shorter word, sometimes the wordbreaker will auto-break at the end of that prefix, meaning that we can no longer correct as expected.

    • Suppose when typing basketball, we typed basker. bask is a perfectly valid English word. basker is technically legal English but is likely quite rare.
    • If basker is not available, we then end up with bask | er, thus with corrections and predictions based on er rather than basker.
    • So... how to mitigate and/or work around this issue?

This specific branch is devoted to the demo / testing-host page that supports a state in which the wordbreaker is usable outside of the worker, able to report the state of wordbreaking as context is edited. I aimed to create something like the Developer JS-keyboard test-host has for character map viewing. To do so means building some extra JS bundles in order to support it, and I'm not sure if these specific changes (for the specialized page) should ever fully land.

Anyway, about the page:

image

Test page link - currently bottom of the testing-index's second section, labeled "Demo / test-harness for dictionary-based wordbreaking". (Link matches commit https://github.com/keymanapp/keyman/pull/10973/commits/9b580aa161b0e0af33b566182899d23f1537353b.)

The test page allows retrieving any Keyman keyboard from the cloud while retrieving the lexical model file locally via file-selector. The lexical model will then be 'lightly hacked' - its backing data will be retrieved (via known private field) and fed to the prototype dict wordbreaker, regardless of whatever the model's actual wordbreaking setting is. (Note: no model is pre-loaded for the page.) Predictive-text itself is not activated, though.

This page should help make assessment easier, allowing us to experiment with how it operates with the data from any existing lexical model. I've already tested it out with our English (MTNT) model quite a bit and found the results fairly enlightening; they've already led to a bug fix or two.

A few sample runs, using English:

dict-breaker sample 1

But how well does it handle stuff not in the lexical-model?

dict-breaker sample 2

Things not in the lexical model:

  • joshua
    • Note: in the built version I have locally, josh is an entry. (I didn't edit it in, either...)
  • i / I
  • am
  • 38

Notably, short words not in the model... naturally aren't available as reference points when doing word-breaking.

These facts combined make some behaviors a bit interesting - note how it acts when and I am (minus the spaces) is typed:

  • and | ia
  • an | diam (because "an" and "diam[ond]" starts to look like a real possibility)
  • When diam is no longer the most recent thing in context, it reverts back to and | iam - diam isn't a word, after all.

As for actual predictive-text use though... note that it's currently fairly impacted by any misspellings in recent context, possibly triggering undesiredly-early wordbreaks when typos occur. That's something that'll need to be addressed in some way.

Known issues:

  • I haven't linked in the search-term keyer yet.
    • Mapping "keyed" spans to their pre-keyed equivalents in the actual text doesn't sound trivial, though - leaving it out definitely helped with quicker prototyping.
  • The patterns for Trie model initialization and for dict-breaker specification aren't currently compatible.
    • The Trie model wants a fully-built wordbreaker at initialization time.
    • The dict-breaker wants the root LexiconTraversal from the Trie, which (currently) isn't available until the Trie model is initialized.
      • I have a pretty decent idea on how to resolve this, but it would need some work - both in the models department and in the model-compiler.

jahorton avatar Mar 11 '24 08:03 jahorton

It appears that when hosted by TC, it's serving up the .mjs script as application/octet-stream, which gets rejected by the browser. sigh

I wonder if renaming all the relevant files with .js instead would be enough...

jahorton avatar Mar 11 '24 09:03 jahorton

OK, that did the trick. TC needs any JS script to be .js; it doesn't like .mjs file-extensions.

jahorton avatar Mar 11 '24 09:03 jahorton

@mhosken @srl295 would like your feedback on this

mcdurdin avatar Mar 11 '24 14:03 mcdurdin

@mhosken @srl295 would like your feedback on this

On my todo to analyze the chain here

srl295 avatar Mar 11 '24 14:03 srl295

@mhosken is not logged into GitHub but he gave me verbal feedback that this looks good.

mcdurdin avatar Mar 11 '24 14:03 mcdurdin

Just realised this is targeting beta but needs to be targeting master -- we won't be able to merge this in during B17S4 as it's too close to release.

mcdurdin avatar Mar 11 '24 14:03 mcdurdin

Please note that I milestoned it "Future." I never intended this to land in 17.0. I'll rebase to master once the current state of beta is merged into it.

jahorton avatar Mar 11 '24 15:03 jahorton

I've set the milestone to 18.0 (unless you really don't want it to merge until 19.0 or later?)

mcdurdin avatar Mar 11 '24 15:03 mcdurdin

Just realised this is targeting beta but needs to be targeting master[...]

Done.

jahorton avatar Mar 27 '24 06:03 jahorton