datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

The naming in gopher_quality_filter seems to be incorrect

Open ryan-minato opened this issue 1 year ago • 0 comments

The naming here seems to be incorrect.

https://github.com/huggingface/datatrove/blob/0f2c69f8249aa0c53ebcf10afa2394da506a953f/src/datatrove/pipeline/filters/gopher_quality_filter.py#L114-L120

Based on the implementation, the variable should likely be min_alpha_words_ratio instead of max_non_alpha_words_ratio. If it is max_non_alpha_words_ratio, it should be 0.2 instead of 0.8.

BTW, using isalpha seems to treat Chinese and Japanese characters as letters. Is this the expected behavior?

ryan-minato avatar May 30 '24 01:05 ryan-minato