MathCAT icon indicating copy to clipboard operation
MathCAT copied to clipboard

Tool to compare translations to make sure they are up-to-date

Open NSoiffer opened this issue 3 years ago • 11 comments

There should be a tool that compares the English version of some rules with a translated version.

When a translation is done, any text output should have the key T as opposed to the English version that uses t. That makes the text as being translated. When the English version gets updated, a tool should be able to tell what needs updating in a translation.

Rules are defined by two keys: name and tag. A simple version of a checker would take two files: a English YAML file and the corresponding translation. It would do two things:

  1. When the name/tag's match, it would make sure that all "t"s have been turned into "T"s (also for the much less frequently used ot/OT and ct/CT).
  2. Note any differences in the name/tag matches. It is probably OK if the translation has one that the English version doesn't, but it very likely a missing translation if the English version has a rule that the translation doesn't.

Files that start with "unicode" are slightly different. Instead of using name/tag, It is just a string (e.g., "=": [t: "equals"]).

There should be two modes to the tool:

  1. Warnings are printed when there are differences
  2. A new file is created where a comment (# NEEDS TRANSLATION) for case 1 and #NEW RULE THAT NEEDS TRANSLATION for case 2 above. A summary of the number of each addition made is printed out.

Once such a tool is written, a shell script (or similar) should be written that takes a rule directory and compares all the files in the directory against the English version with the tool

NSoiffer avatar Nov 19 '22 19:11 NSoiffer

Any preferred languages for the tool and/or shell environments to target?

brichwin avatar May 19 '23 18:05 brichwin

Python is a good language to write tools like this.

Despite my comment, I think a parameter is a better way to deal with the "two modes to the tool" comment at the end. The person running the code should be the one to decide how the tool should operate.

There are a number of existing tools in the PythonScripts sub-directory of MathCAT.

Python knows how to read yaml files ("import yaml"), so my thought was that you read in in the 'en' and other language files into separate dictionaries or lists, sort both of them, and then

  1. report name/tag rules in one that are not in the other
  2. when they exist in both, report differences other than ones in "t"/"T" (and ot, ct) keys.

It is really useful to keep the order in both files, so probably a second pass is needed when writing out new rules or adding a comment about a change. That second pass would be textual and it would just look for the name/tag with a simple string comparison as it linearly reads through the file.

Does that make any sense?

NSoiffer avatar May 19 '23 19:05 NSoiffer

Makes sense. Working on it...

brichwin avatar May 25 '23 01:05 brichwin

Python knows how to read yaml files ("import yaml") ... 2. when they exist in both, report differences other than ones in "t"/"T" (and ot, ct) keys.

I am assuming it's not important to report differences in comments in the rules? There appear to be comments in some of the translated rules that are specific to documenting the translation.

brichwin avatar May 25 '23 02:05 brichwin

Please see issue MathCAT/Rules/Languages/en /unicode-full.yaml contains multiple duplicate entries & incorrect entries #117

I discovered this while running the tool I created while working on this issue. Please let me know if duplicate Unicode chars are a real problem so I know how to handle Unicode translation files. If dups are valid then I need to disambiguate their keys in the python dictionary that tracks them.

brichwin avatar May 26 '23 21:05 brichwin

Q: Many of the YAML rules files contain tabs (to align comments): image Technically, tabs cannot be used in YAML files for indentation and the Python libraries for processing yaml files I've tried abort when finding tabs: image

I'm guessing this is not a problem with how MathCAT is parsing those YAML files?

Should I ignore the tabs and simply replace them in the processing or treat them as a critical error and abort?

brichwin avatar Jun 28 '23 16:06 brichwin

The Rust YAML library I'm using doesn't care. However it would be best to make sure that they are all whitespace.

Do you want to fix them or would you rather I fix them?

FYI: if you are not already doing so when doing the comparison, you should ignore the comments (on separate lines or to the right of code).

NSoiffer avatar Jun 29 '23 01:06 NSoiffer

I'd be happy to convert all of the tabs to spaces in the Rules .yaml files. I forked the repo, switched to a new branch and committed a python script that converts tabs to spaces (in case the task needs performed again in the future) and the updated YAML files. You should have a pull request.

brichwin avatar Jun 29 '23 01:06 brichwin

Please let me know if this is what you are thinking. Given a set of truncated test files contained in the uploaded original_yaml_test_files zip file:

  • en/test.yaml
  • en/unicode-test.yaml
  • es/test.yaml
  • es/unicode-test.yaml

They generated the following files contained in the resulting_yaml_test_files zip file:

  • es/test.yaml (updated file)
  • es/test.yaml.bak (backup file)
  • es/unicode-test.yaml (updated file)
  • es/unicode-test.yaml.bak (backup file)

The program has the following usage:

$ python ../../PythonScripts/audit_rule_translations.py -h
usage: audit_rule_translations.py [-h] [--mode {warnings,new_version}]
                                  [--unicode {true,false,auto}]
                                  english_rules translated_rules

Audit a translated rules file against its English version.

positional arguments:
  english_rules         The English version of the rules YAML file.
  translated_rules      The translated version of the rules YAML file.

options:
  -h, --help            show this help message and exit
  --mode {warnings,new_version}
                        In 'warnings' mode (default), differences between
                        files are listed as warnings. In 'new_version' mode, a
                        new version of the translated file is created with
                        comments where translation is needed.
  --unicode {true,false,auto}
                        Use 'true' to force handling the file as a unicode
                        definitions yaml file. Use 'false' to force handling
                        the file as a non-unicode definitions yaml file. Use
                        'auto' mode (default) to automatically detect if the 
                        file is a unicode definitions yaml file by inspecting
                        the filename for 'unicode'.

Output for --mode warnings (files are not updated) is:

$ python ../../PythonScripts/audit_rule_translations.py en/test.yaml es/test.yaml

Processing 10 items in Non-Unicode mode.

Rule 'into-or-out-of:msubsup' still contains 3 key(s) needing translating.
Rule 'into-or-out-of:mfrac' is missing in the translated file.
Rule 'into-or-out-of:mover' is missing in the translated file.
Rule 'into-or-out-of:mmultiscripts' is missing in the translated file.
Warning: Rule 'into-or-out-of:msup' contains differences other than ones in t, T, OC, oc, CT, ct keys:
  - Values for key: match don't match at path: into-or-out-of:msup['match']:
  -   1: $Move2D != ''
  -   2: $Move3D != ''
Warning: Rule 'into-or-out-of:msubsup' contains differences other than ones in t, T, OC, oc, CT, ct keys:
  - Values for key: if don't match at path: into-or-out-of:msubsup['replace'][1]['test'][0]['if']:
  -   1: count($Child2D/preceding-sibling::*)=0
  -   2: count($Child2D/following-sibling::*)=0

and

$ python ../../PythonScripts/audit_rule_translations.py en/unicode-test.yaml es/unicode-test.yaml

Processing 5 items in Unicode mode.

Rule 'B-Z' still contains 1 key(s) needing translating.
Rule '0-9' still contains 1 key(s) needing translating.
Rule '!' (Unicode char: \u0021) still contains 1 key(s) needing translating.
Rule 'a' (Unicode char: \u0061) is missing in the translated file.
Rule '(' (Unicode char: \u0028) is missing in the translated file.
Warning: Rule 'B-Z' contains differences other than ones in t, T, OC, oc, CT, ct keys:
  - Values for key: value don't match at path: B-Z['B-Z'][2]['pitch']['value']:
  -   1: $CapitalLetters_Pitches
  -   2: $CapitalLetters_Pitch
Warning: Rule '!' (Unicode char: \u0021) contains differences other than ones in t, T, OC, oc, CT, ct keys:
  - Dictionaries don't have the same keys at path: !['!'][0]
  - Keys in first dictionary that are not in second dictionary: {'test'}

Output for --mode new_version (translation file is updated with comments if necessary) is:

$ python ../../PythonScripts/audit_rule_translations.py --mode new_version en/test.yaml es/test.yaml

Processing 10 items in Non-Unicode mode.

Rule 'into-or-out-of:msubsup' still contains 3 key(s) needing translating.
Rule 'into-or-out-of:mfrac' is missing in the translated file.
Rule 'into-or-out-of:mover' is missing in the translated file.
Rule 'into-or-out-of:mmultiscripts' is missing in the translated file.
Warning: Rule 'into-or-out-of:msup' contains differences other than ones in t, T, OC, oc, CT, ct keys:
  - Values for key: match don't match at path: into-or-out-of:msup['match']:
  -   1: $Move2D != ''
  -   2: $Move3D != ''
Warning: Rule 'into-or-out-of:msubsup' contains differences other than ones in t, T, OC, oc, CT, ct keys:
  - Values for key: if don't match at path: into-or-out-of:msubsup['replace'][1]['test'][0]['if']:
  -   1: count($Child2D/preceding-sibling::*)=0
  -   2: count($Child2D/following-sibling::*)=0

Creating new version of translated file with comments where translation is needed.
Missing rules:
  into-or-out-of:mfrac is missing after None
  into-or-out-of:mover is missing after into-or-out-of:munder
  into-or-out-of:mmultiscripts is missing after into-or-out-of:munderover

New version of es/test.yaml created. Original backed up to es/test.yaml.bak.
  3 new rule(s) that need translation.
  1 rule(s) that need translation of keys.
  2 rule(s) with differences other than translation.

and

$ python ../../PythonScripts/audit_rule_translations.py --mode new_version en/unicode-test.yaml es/unicode-test.yaml

Processing 5 items in Unicode mode.

Rule 'B-Z' still contains 1 key(s) needing translating.
Rule '0-9' still contains 1 key(s) needing translating.
Rule '!' (Unicode char: \u0021) still contains 1 key(s) needing translating.
Rule 'a' (Unicode char: \u0061) is missing in the translated file.
Rule '(' (Unicode char: \u0028) is missing in the translated file.
Warning: Rule 'B-Z' contains differences other than ones in t, T, OC, oc, CT, ct keys:
  - Values for key: value don't match at path: B-Z['B-Z'][2]['pitch']['value']:
  -   1: $CapitalLetters_Pitches
  -   2: $CapitalLetters_Pitch
Warning: Rule '!' (Unicode char: \u0021) contains differences other than ones in t, T, OC, oc, CT, ct keys:
  - Dictionaries don't have the same keys at path: !['!'][0]
  - Keys in first dictionary that are not in second dictionary: {'test'}

Creating new version of translated file with comments where translation is needed.
Missing rules:
  a is missing after None
  ( is missing after B-Z

New version of es/unicode-test.yaml created. Original backed up to es/unicode-test.yaml.bak.
  2 new rule(s) that need translation.
  3 rule(s) that need translation of keys.
  2 rule(s) with differences other than translation.

original_yaml_test_files.zip

resulting_yaml_test_files.zip

brichwin avatar Jun 29 '23 02:06 brichwin

If you want to inspect the python script for auditing the rule translations, it can be found here: https://github.com/brichwin/MathCAT/blob/audit_rule_translations/PythonScripts/audit_rule_translations.py

I'm going to work up a few more tests, but please let me know if it is close to what you want.

brichwin avatar Jul 03 '23 14:07 brichwin