steno-dictionaries icon indicating copy to clipboard operation
steno-dictionaries copied to clipboard

Duplicate Keys in dictionaries ("sticks" / "statistics")

Open JRJurman opened this issue 5 years ago • 12 comments

Summary

When looking at the project files in VS Code, I realized that it had highlighted a problem in the top-10000-project-gutenberg-words.json. The problem was a duplicate key in JSON, which, while technically valid, in plover's case is probably not super useful. Most json parsers (and I'm guessing plover including) will ignore the first entry and just show the second (and indeed, if I filter for sticks in the plover dictionary editor, it does not show up).

Potential Solution

I'm not entirely sure how the chords are made for these words, this feels like something that just needs to be updated though - potentially changing sticks to STEUBG/-S

Futureproofing

There is a node module, find-duplicated-property-keys that takes in a dictionary and prints out if there are any duplicated keys. I ran this using the following script

npm i -g find-duplicated-property-keys
for dictionary in dictionaries/*.json; do find-duplicated-property-keys -s "$dictionary"; done > duplicates_log.txt

I install duplicated-property-keys globally (requires node on the machine... technically I could use npx but the command is already kinda slow on these larger files, and doing an install for every file is overkill)

I then run a bash for loop that runs the command, passing in every dictionary in the dictionaries/ folder. The output is forwarded to a duplicates_log.txt, however this part could be removed to just show the output on the command line. It looks something like this:

The following duplicated property keys have been detected in dictionaries/top-10000-project-gutenberg-words.json:
<instance>.STEUBGS
No duplicated property keys found in dictionaries/top-1000-words.json.
No duplicated property keys found in dictionaries/top-100-words.json.
No duplicated property keys found in dictionaries/top-200-words-spoken-on-tv.json.

And I got a lot of duplicated key warnings, in 12 different files.

bad-habits.json, code.json, condensed-strokes.json, currency.json, dict-en-AU-vocab.json, javascript.json, medical-suffixes.json, nouns.json, punctuation-di.json, and top-10000-project-gutenberg-words.json all had a small handful.

modifiers.json has around 2000, numbers.json has around 70.

I've uploaded the output here: https://gist.github.com/JRJurman/ba259871c67f7e086fac01797a72f11a

JRJurman avatar Feb 23 '20 18:02 JRJurman

Nice work! For the "statistics" issue, it looks like dict.json is missing a ST*BGS stroke for "statistics", which is in the Plover dictionaries. Given that, Plover says STEUBGS is for "sticks", I think you could potentially submit a PR that does the following:

  • Adds a "STEUBGS": "sticks" entry in dict.json
  • Changes "STEUBGS": "statistics" entry in dict.json to "ST*BGS": "statistics"
  • Changes "STEUBGS": "statistics" entry in top-10000-project-gutenberg-words.json to "ST*BGS": "statistics"

paulfioravanti avatar Feb 23 '20 20:02 paulfioravanti

I can make a PR with those changes later today 👍

It didn't even occur to me to look at the original plover dictionary for a resolution ✨ . Do you want me to investigate the other conflicts? I can at least give a precursory look to see if there are other easy resolutions... although I feel like numbers and modifiers will be harder to tackle.

JRJurman avatar Feb 23 '20 21:02 JRJurman

I can make those separate PRs too, since I wouldn't want to hold up this change.

JRJurman avatar Feb 23 '20 21:02 JRJurman

Do you want me to investigate the other conflicts? I can at least give a precursory look to see if there are other easy resolutions

I'd say go for it! Anything that makes the dictionaries better for all of us steno learners is a win!

paulfioravanti avatar Feb 23 '20 22:02 paulfioravanti

Thanks for putting this together @JRJurman 👏

Yes, while JSON permits duplicate keys, it's not ideal in practical use. I believe Plover will see every entry, but overwrite previous entries when it finds an outline that already exists, so it's ok to have globally duplicate keys across dictionaries, so long as you know what order to keep your dictionaries in. It's more of an issue to have duplicates within dictionaries as we have here.

For the other dictionaries with duplicates, it will be handy to have separate issues and PRs for those to discuss how to resolve some of the duplicates and ship them one after another.

bad-habits.json could possibly continue to have duplicates. It's an accumulation of bad entries from different places so one key with 2 values might both be wrong and worth marking as bad habits. Doesn't deal with the ambiguity of the keys, but it's not super important to fix.

condensed-strokes.json would be great to fix soon. I've just pushed a branch for fixing the duplicates in numbers.json.

modifiers.json would be worth regenerating from the original script that built it the first time. I'd have to dig up where that came from so it can be updated and re-run.

Thanks @paulfioravanti for outlining the resolution to "sticks" and "statistics". These look good!

didoesdigital avatar Feb 24 '20 07:02 didoesdigital

On futureproofing, it might be nice to set up Travis CI to highlight introduced duplicates on PRs to prevent regressions.

didoesdigital avatar Feb 24 '20 07:02 didoesdigital

As a reminder to myself, here's the link to a convenient online tool for validating JSON that also highlights duplicates: Miscue-js -- JSON validation.

didoesdigital avatar Feb 24 '20 07:02 didoesdigital

If there are scripts to generate some of these .json files, it might be worthwhile keeping them in this repository (under a scripts folder or something?). This would be useful if we wanted to include a duplicates checker or other dictionary generators in the project.

That being said, I do love that for the most part this is just a bunch of json files and it doesn't require reading or installing anything to get the dictionaries.

JRJurman avatar Feb 24 '20 15:02 JRJurman

I would tend to agree with the latter part of your comment. Just keep this repo as it says on the tin: steno-dictionaries.

Unless @didoesdigital wants to continue to maintain and evolve the scripts under her account somewhere, perhaps in a separate repo, I'd say they'd have value as one of your personal projects, @JRJurman.

paulfioravanti avatar Feb 25 '20 01:02 paulfioravanti

Between https://github.com/didoesdigital/typey-type/issues/6 and https://github.com/didoesdigital/typey-type-data/issues/1, I hope to make the private static lesson generator redundant and therefore have fewer scripts using the json files. Most of the 'generated' .json files are stored in https://github.com/didoesdigital/typey-type-data/ at the moment. I'm still on the fence about where the bulk of the logic should live for checking the quality of dictionaries and lessons, and what should be generated and stored vs figured out on the fly (e.g. specific dictionaries could be built in app from lessons). For now, I think I want to keep this repo fairly basic.

didoesdigital avatar Feb 25 '20 22:02 didoesdigital

Hey @didoesdigital, would you mind a PR that removes duplicated entries from modifiers.json? I just ran jq over the file, because it was hard to figure out what's going on when looking some modifier strokes with grep

timon avatar Oct 23 '22 13:10 timon

Sure! Let’s have a look at a PR and see what’s going on there.

didoesdigital avatar Oct 23 '22 23:10 didoesdigital