wikipron
wikipron copied to clipboard
remove default casefolding
Removed the statement casefold:true from the languages.json list. I rescraped hye and apw to confirm that the languages were still scrapped, but now with the original case marking from Wiktionary.
- [x] Updated
UnreleasedinCHANGELOG.mdto reflect the changes in code or data.
This broke a unit test:
def test_casefold_value():
"""Check if each language in data/scrape/lib/languages.json
has a value for 'casefold' key.
"""
missing_languages = set()
with open(_LANGUAGES, "r") as source:
languages = json.load(source)
for language in languages:
> if languages[language]["casefold"] is None:
E KeyError: 'casefold'
project/tests/test_data/test_languages.py:18: KeyError
It seems we assume casefolding is always specified rather than giving a default. We probably should not assume this. @jacksonllee when you have a moment, what do you think about the spirit of this PR? I am in support.
If we're changing the casefold value for lots of languages in languages.json, does this mean that we'd have to re-scrape all these languages for consistency?
Yes, what @jacksonllee said: we'd want to do the full scrape. (For context, I asked @jhdeov off-thread to just pilot a few of them to see how things worked before we attempt a full scrape.)
Returning to thinking about this. My inclination is that we should basically make case-folding non-default and remove the casefolding annotations from languages.json on our next big scrape---and I'll probably do one this summer.
I just implemented this in #523. Closing the PR. Thanks for the suggestion, I think you were right.