feat(web): dictionary-based wordbreaking - draft form with demo page 📚
This PR currently represents an effort to resolve #5025. It's a significant part of the way there, but there's notably more to do before we can call the effort complete. I actually have a lot of intermediate branches "ready to drop" if and when we want to do a proper review, etc of these changes.
Some of these have now been spun off into their own PRs:
- https://github.com/keymanapp/keyman/pull/11867
- https://github.com/keymanapp/keyman/pull/11868
- https://github.com/keymanapp/keyman/pull/11869
- https://github.com/keymanapp/keyman/pull/11870
That said, I still have four more distinct branches comprising this one PR.
- One for the actual, initial implementation
- One for establishing a unit test; we need the ability to draw in and read from a compiled Trie to use as wordbreaker's dictionary
- One to handle parts of the context not found in the dictionary
- And finally, one to build the demo page seen in the examples below.
Known tasks that would still need to be done:
-
The new wordbreaker cannot yet be specified for use by any Developer-compiled lexical model
- This is not straight-forward - our existing Trie model pattern requires that the wordbreaker be built first... but this wordbreaker requires a pre-built Trie to be provided to its constructor.
- Current thought: we could consider building the core Trie first, then providing it separately to both the TrieModel and the wordbreaker.
-
If a typo occurs while typing a long word that has a prefix which is a valid shorter word, sometimes the wordbreaker will auto-break at the end of that prefix, meaning that we can no longer correct as expected.
- Suppose when typing
basketball, we typedbasker.baskis a perfectly valid English word.baskeris technically legal English but is likely quite rare. - If
baskeris not available, we then end up withbask | er, thus with corrections and predictions based onerrather thanbasker. - So... how to mitigate and/or work around this issue?
- Suppose when typing
This specific branch is devoted to the demo / testing-host page that supports a state in which the wordbreaker is usable outside of the worker, able to report the state of wordbreaking as context is edited. I aimed to create something like the Developer JS-keyboard test-host has for character map viewing. To do so means building some extra JS bundles in order to support it, and I'm not sure if these specific changes (for the specialized page) should ever fully land.
Anyway, about the page:
Test page link - currently bottom of the testing-index's second section, labeled "Demo / test-harness for dictionary-based wordbreaking". (Link matches commit https://github.com/keymanapp/keyman/pull/10973/commits/9b580aa161b0e0af33b566182899d23f1537353b.)
The test page allows retrieving any Keyman keyboard from the cloud while retrieving the lexical model file locally via file-selector. The lexical model will then be 'lightly hacked' - its backing data will be retrieved (via known private field) and fed to the prototype dict wordbreaker, regardless of whatever the model's actual wordbreaking setting is. (Note: no model is pre-loaded for the page.) Predictive-text itself is not activated, though.
This page should help make assessment easier, allowing us to experiment with how it operates with the data from any existing lexical model. I've already tested it out with our English (MTNT) model quite a bit and found the results fairly enlightening; they've already led to a bug fix or two.
A few sample runs, using English:
But how well does it handle stuff not in the lexical-model?
Things not in the lexical model:
joshua- Note: in the built version I have locally,
joshis an entry. (I didn't edit it in, either...)
- Note: in the built version I have locally,
i/Iam38
Notably, short words not in the model... naturally aren't available as reference points when doing word-breaking.
These facts combined make some behaviors a bit interesting - note how it acts when and I am (minus the spaces) is typed:
and | iaan | diam(because "an" and "diam[ond]" starts to look like a real possibility)- When
diamis no longer the most recent thing in context, it reverts back toand | iam-diamisn't a word, after all.
As for actual predictive-text use though... note that it's currently fairly impacted by any misspellings in recent context, possibly triggering undesiredly-early wordbreaks when typos occur. That's something that'll need to be addressed in some way.
Known issues:
- I haven't linked in the search-term keyer yet.
- Mapping "keyed" spans to their pre-keyed equivalents in the actual text doesn't sound trivial, though - leaving it out definitely helped with quicker prototyping.
- The patterns for Trie model initialization and for dict-breaker specification aren't currently compatible.
- The Trie model wants a fully-built wordbreaker at initialization time.
- The dict-breaker wants the root
LexiconTraversalfrom the Trie, which (currently) isn't available until the Trie model is initialized.- I have a pretty decent idea on how to resolve this, but it would need some work - both in the models department and in the model-compiler.
User Test Results
Test specification and instructions
ERROR: user tests have not yet been defined
Test Artifacts
- Android
- Developer
- iOS
- Keyboards
- Linux
- macOS
- Web
- Windows
It appears that when hosted by TC, it's serving up the .mjs script as application/octet-stream, which gets rejected by the browser. sigh
I wonder if renaming all the relevant files with .js instead would be enough...
OK, that did the trick. TC needs any JS script to be .js; it doesn't like .mjs file-extensions.
@mhosken @srl295 would like your feedback on this
@mhosken @srl295 would like your feedback on this
On my todo to analyze the chain here
@mhosken is not logged into GitHub but he gave me verbal feedback that this looks good.
Just realised this is targeting beta but needs to be targeting master -- we won't be able to merge this in during B17S4 as it's too close to release.
Please note that I milestoned it "Future." I never intended this to land in 17.0. I'll rebase to master once the current state of beta is merged into it.
I've set the milestone to 18.0 (unless you really don't want it to merge until 19.0 or later?)
Just realised this is targeting beta but needs to be targeting master[...]
Done.