keyman icon indicating copy to clipboard operation
keyman copied to clipboard

feat(web): Review of predictive text code and tests 🚂

Open markcsinclair opened this issue 3 weeks ago • 31 comments

Comments on study of predictive-text\worker-thread\src\main\correction prior to current changes (i.e. commit 69d16269).

markcsinclair avatar Dec 02 '25 17:12 markcsinclair

It would be good to have an SIL boilerplate header on all source files (@mcdurdin says not strictly necessary):

  • classical-calculation.ts
  • context-tracker.ts
  • distance-modeler.ts
  • execution-timer.ts
  • index.ts
  • segmentation-calculation.ts
  • tokenization-subsets.ts

markcsinclair avatar Dec 02 '25 17:12 markcsinclair

Starting by looking at classical-calculation.ts.

Could we add a references to Hamming, Levenshtein and Damerau-Levenshtein in the references for ClassicalDistanceCalculation() in classical-calculation.ts? The Hamming distance isn't strictly needed, but some people will be familiar with this, and there is a progression of complexity from Hamming -> Levenshtein -> Damerau-Levenshtein.

markcsinclair avatar Dec 02 '25 17:12 markcsinclair

If possible, could we add Javadoc comments to some additional functions/classes, such as:

markcsinclair avatar Dec 02 '25 19:12 markcsinclair

As hasFinalCostWithin() is only used in tests, should it be in a source file?

Similarly, should visualizeCalculation() and the commented-out visualize() be in a source file? Perh aps they could be moved to a file in the tests folder?

markcsinclair avatar Dec 02 '25 19:12 markcsinclair

Can you give some extra explanation for validate the match? Is this just the check on line 974, or the input == match on 978?

markcsinclair avatar Dec 02 '25 20:12 markcsinclair

The code relating to 'merge' (here and here) seems unused.

Why is 'split' not one of the EditOperations?

I see that there is an ExtendedEditOperation in segmentable-calculation.ts, but this is not used in classical-calculation.ts.

markcsinclair avatar Dec 02 '25 20:12 markcsinclair

The local variable totalId seems potentially confusing because of the usual meaning of Id. Might something like totalInsertDelete be clearer?

markcsinclair avatar Dec 02 '25 21:12 markcsinclair

Could much of the recursive backtracing of paths (PathBuilder/EdgeFinder) not be replaced by storing the operations as the edge-distance array is built? You could then build them into paths more straightforwardly.

markcsinclair avatar Dec 03 '25 10:12 markcsinclair

editpath() could be replaced by a Dijstra algorithm parameterised by a function to determine the operation costs (equivalent to link lengths in the original algorithm, one for matches, one for matches + substitutions and one for all three); it would be aided by storing the operations as the edge-distance array is built. This would avoid assembling and sorting all the paths.

markcsinclair avatar Dec 03 '25 10:12 markcsinclair

Are these getters ever used?

markcsinclair avatar Dec 03 '25 11:12 markcsinclair

Looking at context-token.ts.

It would be good to see some unit tests on sourceRangeKey()

markcsinclair avatar Dec 03 '25 12:12 markcsinclair

A Javadoc comment on toMathematicalSMP() in context-token.tests.ts would be helpful, including SMP is Supplementary Multilingual Plane

markcsinclair avatar Dec 03 '25 12:12 markcsinclair

It would be good to have an SIL boilerplate header on all source files (@mcdurdin says not stricktly necessary):

As in, for existing code we add it as we work on the codebase but we are not going back and adding to all source files as that's just busywork :grin:

mcdurdin avatar Dec 03 '25 13:12 mcdurdin

@jahorton @mcdurdin

Design choices (code structure, algorithms, etc) should be driven not only by execution speed and memory use (although these can be crucial considerations, especially in some circumstances), but also by aiding the comprehensibility of the code, and the ease of refactoring and maintenance (both bug fixing and the addition of new features), especially by someone other than the original developer.

As well as establishing the correctness of code and supporting potential refactoring, detailed and comprehensive unit testing is useful for aiding the understanding of the source code by readers other than the original developer (plus the original developer any time after six months or so).

markcsinclair avatar Dec 04 '25 09:12 markcsinclair

The code relating to 'merge' (here and here) seems unused.

Why is 'split' not one of the EditOperations?

I see that there is an ExtendedEditOperation in segmentable-calculation.ts, but this is not used in classical-calculation.ts.

Yeah, SegmentableDistanceCalculation is a subclass of the type - it adds support for both split and merge. We only wish to support those edit types under certain conditions.

Yes, I found the subclass, but in the context of the parent class, you should not be utilising the extensions unless this is done via overloaded methods (polymorphism) otherwise encapsulation is broken (causing confusion to the reader).

jahorton avatar Dec 04 '25 17:12 jahorton

Can you give some extra explanation for validate the match? Is this just the check on line 974, or the input == match on 978?

Line 978 - the lines immediately following it.

jahorton avatar Dec 04 '25 17:12 jahorton

Design choices (code structure, algorithms, etc) should be driven not only by execution speed and memory use (although these can be crucial considerations, especially in some circumstances), but also by aiding the comprehensibility of the code, and the ease of refactoring and maintenance (both bug fixing and the addition of new features), especially by someone other than the original developer.

Yes, in majority of cases my preference is for code comprehensibility to be the top priority. Predictive text is one of the cases where memory/speed optimization is actually quite important - we are dealing with significant amounts of data in lower-resource environments. This is a discussion we've had a few times internally in the team, but I don't think it's made it into our written programming guides.

As well as establishing the correctness of code and supporting potential refactoring, detailed and comprehensive unit testing is useful for aiding the understanding of the source code by readers other than the original developer (plus the original developer any time after six months or so).

Unit tests for learning the code base are in my experience a mixed bag -- boilerplate repetition which is common in tests can mean details are lost in the boilerplate noise, but abstraction can hide the details even more effectively. Trying to write tidy unit tests is an art in itself (and no, AI does not help here -- at all)!

mcdurdin avatar Dec 05 '25 09:12 mcdurdin

Propagating your comments into our code guide, @markcsinclair:

https://github.com/keymanapp/keyman/wiki/Keyman-Code-Style-Guide#code-maintainability

Feel free to edit further!

mcdurdin avatar Dec 05 '25 09:12 mcdurdin

Returning to look at context-token.ts.

It would be good to have data field Javadoc comments on:

Also Javadoc header for:

markcsinclair avatar Dec 08 '25 11:12 markcsinclair

Moving to context-state.ts.

In analyzeTransition() (ln196), why can token substitution fail for large applied suggestions?

How can transformDistrution be optional? (here)

markcsinclair avatar Dec 10 '25 11:12 markcsinclair

In context-state.tests.ts, I am not sure I understand this often repeated comment:

      // // Phrased this way to facilitate TS type-inference; assert.isTrue() does
      // // NOT do this for us!
      // if(!newContextMatch.final.tokenization.alignment.canAlign) {
      //   assert.fail("context alignment failed");
      // }
      // assert.equal(newContextMatch.final.tokenization.alignment.leadTokenShift, 0);
      // assert.equal(newContextMatch.final.tokenization.alignment.tailTokenShift, 0);

markcsinclair avatar Dec 10 '25 12:12 markcsinclair

In context-state.tests.ts, should determineContextSlideDeltas be determineContextSlideTransform?

markcsinclair avatar Dec 10 '25 12:12 markcsinclair

Turning to context-tokenization.ts.

It would be good to have Javadoc comments on:

markcsinclair avatar Dec 10 '25 14:12 markcsinclair

In context-tokenization.tests.ts, could some of the repeated helper/variables (here and here) be imported from context-token.tests.ts or plced in a shared helper file?

Should testEdgeWindowSpec not be initialised to the constants in context-tokenization.ts?

It would be good to see some tests of determineTaillessTrueKeystroke().

markcsinclair avatar Dec 10 '25 14:12 markcsinclair

It would be good to have an SIL boilerplate header on all source files (@mcdurdin says not stricktly necessary):

As in, for existing code we add it as we work on the codebase but we are not going back and adding to all source files as that's just busywork 😁

Although, nearly all of the files I am mentioning are those touched by Joshua's subsequent PRs, so could come under the 'as we work on the codebase' clause 😉

markcsinclair avatar Dec 11 '25 10:12 markcsinclair

If possible, could we add Javadoc comments to some additional functions/classes, such as:

Addressed in PR #15294

markcsinclair avatar Dec 11 '25 11:12 markcsinclair

Are these getters ever used?

Address in PR #15296

markcsinclair avatar Dec 11 '25 11:12 markcsinclair

Can you give some extra explanation for validate the match? Is this just the check on line 974, or the input == match on 978?

Line 978 - the lines immediately following it.

Could we move it into the next block then (just before current ln 979), to be clearer?

markcsinclair avatar Dec 11 '25 11:12 markcsinclair

Can you give some extra explanation for validate the match? Is this just the check on line 974, or the input == match on 978?

Line 978 - the lines immediately following it.

Could we move it into the next block then (just before current ln 979), to be clearer?

You find it clearer for the documentation of the if-condition to be after the condition? I personally prefer all comments about a line to be before it.

jahorton avatar Dec 11 '25 14:12 jahorton