dps icon indicating copy to clipboard operation
dps copied to clipboard

[ja] `.filter` is used instead of `.map` for non-filter methods

Open mrorii opened this issue 1 year ago • 1 comments

On https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L64-L75 there are several cases where we are using .filter but instead it should be a .map.

For example https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L73 calls https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/prep/japanese_prep.py#L64-L67 but in effect this is doing nothing because the expression within .filter is always is true, as long as text is non-empty:

>>> def reduce_japanese_emoticon(text):
...     text = re.sub("w{3,}", "www", text)
...     text = re.sub("笑{2,}", "笑", text)
...     return text
>>> rdd = sc.parallelize([{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}, {'text': ''}])
>>> rdd.filter(lambda x: reduce_japanese_emoticon(x['text'])).collect()
[{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}]

Thus, I think the following cases of .filter are simply doing nothing instead of the intended preprocessing:

  • preprocess_text on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L70
  • reduce_japanese_emoticon on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L73
  • remove_symbols on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L75

The remaining calls to methods that end with _filter (e.g. japanese_bad_words_filter, doc_len_filter, etc.) are actually filter methods that return booleans so they should be OK.

mrorii avatar Jul 14 '23 14:07 mrorii