datatrove
datatrove copied to clipboard
Naming Gopher's "max_non_alpha_words_ratio"
In the Gopher filter, there's this filter
# that 80 % of words in a document contain at least one alphabetic character
if (
self.max_non_alpha_words_ratio
and sum([any((c.isalpha() for c in w)) for w in words]) / n_words < self.max_non_alpha_words_ratio
):
return False, "gopher_below_alpha_threshold"
Given that all documents that have a LOWER ratio are removed, I would expect the variable name to be min_non_alpha_words_ratio, similar to all other variable names.