datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Naming Gopher's "max_non_alpha_words_ratio"

Open BramVanroy opened this issue 5 months ago • 0 comments

In the Gopher filter, there's this filter

# that 80 % of words in a document contain at least one alphabetic character
if (
    self.max_non_alpha_words_ratio
    and sum([any((c.isalpha() for c in w)) for w in words]) / n_words < self.max_non_alpha_words_ratio
):
    return False, "gopher_below_alpha_threshold"

Given that all documents that have a LOWER ratio are removed, I would expect the variable name to be min_non_alpha_words_ratio, similar to all other variable names.

BramVanroy avatar Sep 21 '24 14:09 BramVanroy