Tool to compare translations to make sure they are up-to-date
There should be a tool that compares the English version of some rules with a translated version.
When a translation is done, any text output should have the key T as opposed to the English version that uses t. That makes the text as being translated. When the English version gets updated, a tool should be able to tell what needs updating in a translation.
Rules are defined by two keys: name and tag. A simple version of a checker would take two files: a English YAML file and the corresponding translation. It would do two things:
- When the name/tag's match, it would make sure that all "t"s have been turned into "T"s (also for the much less frequently used
ot/OTandct/CT). - Note any differences in the name/tag matches. It is probably OK if the translation has one that the English version doesn't, but it very likely a missing translation if the English version has a rule that the translation doesn't.
Files that start with "unicode" are slightly different. Instead of using name/tag, It is just a string (e.g., "=": [t: "equals"]).
There should be two modes to the tool:
- Warnings are printed when there are differences
- A new file is created where a comment (
# NEEDS TRANSLATION) for case 1 and#NEW RULE THAT NEEDS TRANSLATIONfor case 2 above. A summary of the number of each addition made is printed out.
Once such a tool is written, a shell script (or similar) should be written that takes a rule directory and compares all the files in the directory against the English version with the tool
Any preferred languages for the tool and/or shell environments to target?
Python is a good language to write tools like this.
Despite my comment, I think a parameter is a better way to deal with the "two modes to the tool" comment at the end. The person running the code should be the one to decide how the tool should operate.
There are a number of existing tools in the PythonScripts sub-directory of MathCAT.
Python knows how to read yaml files ("import yaml"), so my thought was that you read in in the 'en' and other language files into separate dictionaries or lists, sort both of them, and then
- report name/tag rules in one that are not in the other
- when they exist in both, report differences other than ones in "t"/"T" (and ot, ct) keys.
It is really useful to keep the order in both files, so probably a second pass is needed when writing out new rules or adding a comment about a change. That second pass would be textual and it would just look for the name/tag with a simple string comparison as it linearly reads through the file.
Does that make any sense?
Makes sense. Working on it...
Python knows how to read yaml files ("import yaml") ... 2. when they exist in both, report differences other than ones in "t"/"T" (and ot, ct) keys.
I am assuming it's not important to report differences in comments in the rules? There appear to be comments in some of the translated rules that are specific to documenting the translation.
Please see issue MathCAT/Rules/Languages/en /unicode-full.yaml contains multiple duplicate entries & incorrect entries #117
I discovered this while running the tool I created while working on this issue. Please let me know if duplicate Unicode chars are a real problem so I know how to handle Unicode translation files. If dups are valid then I need to disambiguate their keys in the python dictionary that tracks them.
Q: Many of the YAML rules files contain tabs (to align comments):
Technically, tabs cannot be used in YAML files for indentation and the Python libraries for processing yaml files I've tried abort when finding tabs:
I'm guessing this is not a problem with how MathCAT is parsing those YAML files?
Should I ignore the tabs and simply replace them in the processing or treat them as a critical error and abort?
The Rust YAML library I'm using doesn't care. However it would be best to make sure that they are all whitespace.
Do you want to fix them or would you rather I fix them?
FYI: if you are not already doing so when doing the comparison, you should ignore the comments (on separate lines or to the right of code).
I'd be happy to convert all of the tabs to spaces in the Rules .yaml files. I forked the repo, switched to a new branch and committed a python script that converts tabs to spaces (in case the task needs performed again in the future) and the updated YAML files. You should have a pull request.
Please let me know if this is what you are thinking. Given a set of truncated test files contained in the uploaded original_yaml_test_files zip file:
- en/test.yaml
- en/unicode-test.yaml
- es/test.yaml
- es/unicode-test.yaml
They generated the following files contained in the resulting_yaml_test_files zip file:
- es/test.yaml (updated file)
- es/test.yaml.bak (backup file)
- es/unicode-test.yaml (updated file)
- es/unicode-test.yaml.bak (backup file)
The program has the following usage:
$ python ../../PythonScripts/audit_rule_translations.py -h
usage: audit_rule_translations.py [-h] [--mode {warnings,new_version}]
[--unicode {true,false,auto}]
english_rules translated_rules
Audit a translated rules file against its English version.
positional arguments:
english_rules The English version of the rules YAML file.
translated_rules The translated version of the rules YAML file.
options:
-h, --help show this help message and exit
--mode {warnings,new_version}
In 'warnings' mode (default), differences between
files are listed as warnings. In 'new_version' mode, a
new version of the translated file is created with
comments where translation is needed.
--unicode {true,false,auto}
Use 'true' to force handling the file as a unicode
definitions yaml file. Use 'false' to force handling
the file as a non-unicode definitions yaml file. Use
'auto' mode (default) to automatically detect if the
file is a unicode definitions yaml file by inspecting
the filename for 'unicode'.
Output for --mode warnings (files are not updated) is:
$ python ../../PythonScripts/audit_rule_translations.py en/test.yaml es/test.yaml
Processing 10 items in Non-Unicode mode.
Rule 'into-or-out-of:msubsup' still contains 3 key(s) needing translating.
Rule 'into-or-out-of:mfrac' is missing in the translated file.
Rule 'into-or-out-of:mover' is missing in the translated file.
Rule 'into-or-out-of:mmultiscripts' is missing in the translated file.
Warning: Rule 'into-or-out-of:msup' contains differences other than ones in t, T, OC, oc, CT, ct keys:
- Values for key: match don't match at path: into-or-out-of:msup['match']:
- 1: $Move2D != ''
- 2: $Move3D != ''
Warning: Rule 'into-or-out-of:msubsup' contains differences other than ones in t, T, OC, oc, CT, ct keys:
- Values for key: if don't match at path: into-or-out-of:msubsup['replace'][1]['test'][0]['if']:
- 1: count($Child2D/preceding-sibling::*)=0
- 2: count($Child2D/following-sibling::*)=0
and
$ python ../../PythonScripts/audit_rule_translations.py en/unicode-test.yaml es/unicode-test.yaml
Processing 5 items in Unicode mode.
Rule 'B-Z' still contains 1 key(s) needing translating.
Rule '0-9' still contains 1 key(s) needing translating.
Rule '!' (Unicode char: \u0021) still contains 1 key(s) needing translating.
Rule 'a' (Unicode char: \u0061) is missing in the translated file.
Rule '(' (Unicode char: \u0028) is missing in the translated file.
Warning: Rule 'B-Z' contains differences other than ones in t, T, OC, oc, CT, ct keys:
- Values for key: value don't match at path: B-Z['B-Z'][2]['pitch']['value']:
- 1: $CapitalLetters_Pitches
- 2: $CapitalLetters_Pitch
Warning: Rule '!' (Unicode char: \u0021) contains differences other than ones in t, T, OC, oc, CT, ct keys:
- Dictionaries don't have the same keys at path: !['!'][0]
- Keys in first dictionary that are not in second dictionary: {'test'}
Output for --mode new_version (translation file is updated with comments if necessary) is:
$ python ../../PythonScripts/audit_rule_translations.py --mode new_version en/test.yaml es/test.yaml
Processing 10 items in Non-Unicode mode.
Rule 'into-or-out-of:msubsup' still contains 3 key(s) needing translating.
Rule 'into-or-out-of:mfrac' is missing in the translated file.
Rule 'into-or-out-of:mover' is missing in the translated file.
Rule 'into-or-out-of:mmultiscripts' is missing in the translated file.
Warning: Rule 'into-or-out-of:msup' contains differences other than ones in t, T, OC, oc, CT, ct keys:
- Values for key: match don't match at path: into-or-out-of:msup['match']:
- 1: $Move2D != ''
- 2: $Move3D != ''
Warning: Rule 'into-or-out-of:msubsup' contains differences other than ones in t, T, OC, oc, CT, ct keys:
- Values for key: if don't match at path: into-or-out-of:msubsup['replace'][1]['test'][0]['if']:
- 1: count($Child2D/preceding-sibling::*)=0
- 2: count($Child2D/following-sibling::*)=0
Creating new version of translated file with comments where translation is needed.
Missing rules:
into-or-out-of:mfrac is missing after None
into-or-out-of:mover is missing after into-or-out-of:munder
into-or-out-of:mmultiscripts is missing after into-or-out-of:munderover
New version of es/test.yaml created. Original backed up to es/test.yaml.bak.
3 new rule(s) that need translation.
1 rule(s) that need translation of keys.
2 rule(s) with differences other than translation.
and
$ python ../../PythonScripts/audit_rule_translations.py --mode new_version en/unicode-test.yaml es/unicode-test.yaml
Processing 5 items in Unicode mode.
Rule 'B-Z' still contains 1 key(s) needing translating.
Rule '0-9' still contains 1 key(s) needing translating.
Rule '!' (Unicode char: \u0021) still contains 1 key(s) needing translating.
Rule 'a' (Unicode char: \u0061) is missing in the translated file.
Rule '(' (Unicode char: \u0028) is missing in the translated file.
Warning: Rule 'B-Z' contains differences other than ones in t, T, OC, oc, CT, ct keys:
- Values for key: value don't match at path: B-Z['B-Z'][2]['pitch']['value']:
- 1: $CapitalLetters_Pitches
- 2: $CapitalLetters_Pitch
Warning: Rule '!' (Unicode char: \u0021) contains differences other than ones in t, T, OC, oc, CT, ct keys:
- Dictionaries don't have the same keys at path: !['!'][0]
- Keys in first dictionary that are not in second dictionary: {'test'}
Creating new version of translated file with comments where translation is needed.
Missing rules:
a is missing after None
( is missing after B-Z
New version of es/unicode-test.yaml created. Original backed up to es/unicode-test.yaml.bak.
2 new rule(s) that need translation.
3 rule(s) that need translation of keys.
2 rule(s) with differences other than translation.
If you want to inspect the python script for auditing the rule translations, it can be found here: https://github.com/brichwin/MathCAT/blob/audit_rule_translations/PythonScripts/audit_rule_translations.py
I'm going to work up a few more tests, but please let me know if it is close to what you want.