wikipron icon indicating copy to clipboard operation
wikipron copied to clipboard

remove default casefolding

Open jhdeov opened this issue 2 years ago • 4 comments

Removed the statement casefold:true from the languages.json list. I rescraped hye and apw to confirm that the languages were still scrapped, but now with the original case marking from Wiktionary.

  • [x] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.

jhdeov avatar Nov 07 '22 05:11 jhdeov

This broke a unit test:

def test_casefold_value():
        """Check if each language in data/scrape/lib/languages.json
        has a value for 'casefold' key.
        """
        missing_languages = set()
        with open(_LANGUAGES, "r") as source:
            languages = json.load(source)
        for language in languages:
>           if languages[language]["casefold"] is None:
E           KeyError: 'casefold'

project/tests/test_data/test_languages.py:18: KeyError

It seems we assume casefolding is always specified rather than giving a default. We probably should not assume this. @jacksonllee when you have a moment, what do you think about the spirit of this PR? I am in support.

kylebgorman avatar Nov 07 '22 14:11 kylebgorman

If we're changing the casefold value for lots of languages in languages.json, does this mean that we'd have to re-scrape all these languages for consistency?

jacksonllee avatar Nov 27 '22 15:11 jacksonllee

Yes, what @jacksonllee said: we'd want to do the full scrape. (For context, I asked @jhdeov off-thread to just pilot a few of them to see how things worked before we attempt a full scrape.)

kylebgorman avatar Nov 27 '22 15:11 kylebgorman

Returning to thinking about this. My inclination is that we should basically make case-folding non-default and remove the casefolding annotations from languages.json on our next big scrape---and I'll probably do one this summer.

kylebgorman avatar Mar 16 '23 16:03 kylebgorman

I just implemented this in #523. Closing the PR. Thanks for the suggestion, I think you were right.

kylebgorman avatar Mar 07 '24 15:03 kylebgorman