datatrove
datatrove copied to clipboard
The naming in gopher_quality_filter seems to be incorrect
The naming here seems to be incorrect.
https://github.com/huggingface/datatrove/blob/0f2c69f8249aa0c53ebcf10afa2394da506a953f/src/datatrove/pipeline/filters/gopher_quality_filter.py#L114-L120
Based on the implementation, the variable should likely be min_alpha_words_ratio instead of max_non_alpha_words_ratio. If it is max_non_alpha_words_ratio, it should be 0.2 instead of 0.8.
BTW, using isalpha seems to treat Chinese and Japanese characters as letters. Is this the expected behavior?