OpusCleaner The regexes for characters in a language match on only 1 character, not the entire token

The regexes for characters in a language match on only 1 character, not the entire token

Open gregtatum opened this issue 6 months ago • 4 comments

https://github.com/hplt-project/OpusCleaner/blob/main/opuscleaner/filters/clean_common.py

The regexes for characters in a language match on only 1 character, not the entire token:

    'ca': r'[a-zÀàÈèÉéÍíÒòÓóÚúÇç]',
    'cs': r'[a-zÁáČčĎďÉéěÍíŇňÓóŘřŠšŤťÚúůÝýŽž]',
    'da': r'[a-zÆæØøÅå]',
    'de': r'[a-zÄäÖöÜüß]',

As opposed to:

    'ca': r'^[a-zÀàÈèÉéÍíÒòÓóÚúÇç]$',
    'cs': r'^[a-zÁáČčĎďÉéěÍíŇňÓóŘřŠšŤťÚúůÝýŽž]$',
    'da': r'^[a-zÆæØøÅå]$',
    'de': r'^[a-zÄäÖöÜüß]$',

So for instance "0a1e" will match as a word in this case. I don't know if this is the intention of this filter.

For instance here is how a regex is used:

            num_words = sum(
                [1 if re.match(CHARS[src_lang], t, re.IGNORECASE) else 0 for t in src_toks])

Apr 28 '25 19:04 gregtatum