keyman feat(web): Review of predictive text code and tests 🚂

Comments on study of predictive-text\worker-thread\src\main\correction prior to current changes (i.e. commit 69d16269).

Dec 02 '25 17:12 markcsinclair

It would be good to have an SIL boilerplate header on all source files (@mcdurdin says not strictly necessary):

classical-calculation.ts
context-tracker.ts
distance-modeler.ts
execution-timer.ts
index.ts
segmentation-calculation.ts
tokenization-subsets.ts

Dec 02 '25 17:12 markcsinclair

Starting by looking at classical-calculation.ts.

Could we add a references to Hamming, Levenshtein and Damerau-Levenshtein in the references for ClassicalDistanceCalculation() in classical-calculation.ts? The Hamming distance isn't strictly needed, but some people will be familiar with this, and there is a progression of complexity from Hamming -> Levenshtein -> Damerau-Levenshtein.

Dec 02 '25 17:12 markcsinclair

If possible, could we add Javadoc comments to some additional functions/classes, such as:

getTrueIndex()
getCostAt()
initialCostAt()
getSubset()
addInputChar()
_addInputChar()
addMatchChar()
_addMatchChar()
increaseMaxDistance()
propagateUpdateFrom()
getTransposeParent()
findTransposeEdges()
EdgeFinder
PathBuilder
- addEdgeFinder()
- backtracePath()

Dec 02 '25 19:12 markcsinclair

As hasFinalCostWithin() is only used in tests, should it be in a source file?

Similarly, should visualizeCalculation() and the commented-out visualize() be in a source file? Perh aps they could be moved to a file in the tests folder?

Dec 02 '25 19:12 markcsinclair

Can you give some extra explanation for validate the match? Is this just the check on line 974, or the input == match on 978?

Dec 02 '25 20:12 markcsinclair

It would be good to see some unit tests on:

Dec 02 '25 20:12 markcsinclair

The code relating to 'merge' (here and here) seems unused.

Why is 'split' not one of the EditOperations?

I see that there is an ExtendedEditOperation in segmentable-calculation.ts, but this is not used in classical-calculation.ts.

Dec 02 '25 20:12 markcsinclair

The local variable totalId seems potentially confusing because of the usual meaning of Id. Might something like totalInsertDelete be clearer?

Dec 02 '25 21:12 markcsinclair

Could much of the recursive backtracing of paths (PathBuilder/EdgeFinder) not be replaced by storing the operations as the edge-distance array is built? You could then build them into paths more straightforwardly.

Dec 03 '25 10:12 markcsinclair

editpath() could be replaced by a Dijstra algorithm parameterised by a function to determine the operation costs (equivalent to link lengths in the original algorithm, one for matches, one for matches + substitutions and one for all three); it would be aided by storing the operations as the edge-distance array is built. This would avoid assembling and sorting all the paths.

Dec 03 '25 10:12 markcsinclair

Are these getters ever used?

Dec 03 '25 11:12 markcsinclair

Looking at context-token.ts.

It would be good to see some unit tests on sourceRangeKey()

Dec 03 '25 12:12 markcsinclair

A Javadoc comment on toMathematicalSMP() in context-token.tests.ts would be helpful, including SMP is Supplementary Multilingual Plane

Dec 03 '25 12:12 markcsinclair

It would be good to have an SIL boilerplate header on all source files (@mcdurdin says not stricktly necessary):

As in, for existing code we add it as we work on the codebase but we are not going back and adding to all source files as that's just busywork :grin:

Dec 03 '25 13:12 mcdurdin

@jahorton @mcdurdin

Design choices (code structure, algorithms, etc) should be driven not only by execution speed and memory use (although these can be crucial considerations, especially in some circumstances), but also by aiding the comprehensibility of the code, and the ease of refactoring and maintenance (both bug fixing and the addition of new features), especially by someone other than the original developer.

As well as establishing the correctness of code and supporting potential refactoring, detailed and comprehensive unit testing is useful for aiding the understanding of the source code by readers other than the original developer (plus the original developer any time after six months or so).

Dec 04 '25 09:12 markcsinclair

The code relating to 'merge' (here and here) seems unused.

Why is 'split' not one of the EditOperations?

I see that there is an ExtendedEditOperation in segmentable-calculation.ts, but this is not used in classical-calculation.ts.

Yeah, SegmentableDistanceCalculation is a subclass of the type - it adds support for both split and merge. We only wish to support those edit types under certain conditions.

Yes, I found the subclass, but in the context of the parent class, you should not be utilising the extensions unless this is done via overloaded methods (polymorphism) otherwise encapsulation is broken (causing confusion to the reader).

Dec 04 '25 17:12 jahorton

Can you give some extra explanation for validate the match? Is this just the check on line 974, or the input == match on 978?

Line 978 - the lines immediately following it.

Dec 04 '25 17:12 jahorton

Design choices (code structure, algorithms, etc) should be driven not only by execution speed and memory use (although these can be crucial considerations, especially in some circumstances), but also by aiding the comprehensibility of the code, and the ease of refactoring and maintenance (both bug fixing and the addition of new features), especially by someone other than the original developer.

Yes, in majority of cases my preference is for code comprehensibility to be the top priority. Predictive text is one of the cases where memory/speed optimization is actually quite important - we are dealing with significant amounts of data in lower-resource environments. This is a discussion we've had a few times internally in the team, but I don't think it's made it into our written programming guides.

As well as establishing the correctness of code and supporting potential refactoring, detailed and comprehensive unit testing is useful for aiding the understanding of the source code by readers other than the original developer (plus the original developer any time after six months or so).

Unit tests for learning the code base are in my experience a mixed bag -- boilerplate repetition which is common in tests can mean details are lost in the boilerplate noise, but abstraction can hide the details even more effectively. Trying to write tidy unit tests is an art in itself (and no, AI does not help here -- at all)!

Dec 05 '25 09:12 mcdurdin

Propagating your comments into our code guide, @markcsinclair:

https://github.com/keymanapp/keyman/wiki/Keyman-Code-Style-Guide#code-maintainability

Feel free to edit further!

Dec 05 '25 09:12 mcdurdin

Returning to look at context-token.ts.

It would be good to have data field Javadoc comments on:

TokenInputSource.trueTransform
TokenInputSource.inputStartIndex
ContextToken.isPartial
ContextToken.merge() (currently an incorrect @param split is given)
ContextToken.split()

Also Javadoc header for:

preprocessInputSources()

Dec 08 '25 11:12 markcsinclair

Moving to context-state.ts.

In analyzeTransition() (ln196), why can token substitution fail for large applied suggestions?

How can transformDistrution be optional? (here)

Dec 10 '25 11:12 markcsinclair

In context-state.tests.ts, I am not sure I understand this often repeated comment:

      // // Phrased this way to facilitate TS type-inference; assert.isTrue() does
      // // NOT do this for us!
      // if(!newContextMatch.final.tokenization.alignment.canAlign) {
      //   assert.fail("context alignment failed");
      // }
      // assert.equal(newContextMatch.final.tokenization.alignment.leadTokenShift, 0);
      // assert.equal(newContextMatch.final.tokenization.alignment.tailTokenShift, 0);

Dec 10 '25 12:12 markcsinclair

In context-state.tests.ts, should determineContextSlideDeltas be determineContextSlideTransform?

Dec 10 '25 12:12 markcsinclair

Turning to context-tokenization.ts.

It would be good to have Javadoc comments on:

EditTokenMap
TokenMergeMap
TokenSplitMap
analyzePathMergesAndSplits() params
determineTaillessTrueKeystroke() param

Dec 10 '25 14:12 markcsinclair

In context-tokenization.tests.ts, could some of the repeated helper/variables (here and here) be imported from context-token.tests.ts or plced in a shared helper file?

Should testEdgeWindowSpec not be initialised to the constants in context-tokenization.ts?

It would be good to see some tests of determineTaillessTrueKeystroke().

Dec 10 '25 14:12 markcsinclair

It would be good to have an SIL boilerplate header on all source files (@mcdurdin says not stricktly necessary):

As in, for existing code we add it as we work on the codebase but we are not going back and adding to all source files as that's just busywork 😁

Although, nearly all of the files I am mentioning are those touched by Joshua's subsequent PRs, so could come under the 'as we work on the codebase' clause 😉

Dec 11 '25 10:12 markcsinclair

If possible, could we add Javadoc comments to some additional functions/classes, such as:

getTrueIndex()

getCostAt()

initialCostAt()

getSubset()

addInputChar()

_addInputChar()

addMatchChar()

_addMatchChar()

increaseMaxDistance()

propagateUpdateFrom()

getTransposeParent()

findTransposeEdges()

EdgeFinder

PathBuilder

addEdgeFinder()

backtracePath()

Addressed in PR #15294

Dec 11 '25 11:12 markcsinclair

Are these getters ever used?

mapKey()

lastInputEntry()

lastMatchEntry()

Address in PR #15296

Dec 11 '25 11:12 markcsinclair

Can you give some extra explanation for validate the match? Is this just the check on line 974, or the input == match on 978?

Line 978 - the lines immediately following it.

Could we move it into the next block then (just before current ln 979), to be clearer?

Dec 11 '25 11:12 markcsinclair

Can you give some extra explanation for validate the match? Is this just the check on line 974, or the input == match on 978?

Line 978 - the lines immediately following it.

Could we move it into the next block then (just before current ln 979), to be clearer?

You find it clearer for the documentation of the if-condition to be after the condition? I personally prefer all comments about a line to be before it.

Dec 11 '25 14:12 jahorton