dps
dps copied to clipboard
[ja] `.filter` is used instead of `.map` for non-filter methods
On https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L64-L75 there are several cases where we are using .filter
but instead it should be a .map
.
For example https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L73 calls https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/prep/japanese_prep.py#L64-L67 but in effect this is doing nothing because the expression within .filter is always is true, as long as text is non-empty:
>>> def reduce_japanese_emoticon(text):
... text = re.sub("w{3,}", "www", text)
... text = re.sub("笑{2,}", "笑", text)
... return text
>>> rdd = sc.parallelize([{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}, {'text': ''}])
>>> rdd.filter(lambda x: reduce_japanese_emoticon(x['text'])).collect()
[{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}]
Thus, I think the following cases of .filter
are simply doing nothing instead of the intended preprocessing:
-
preprocess_text
on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L70 -
reduce_japanese_emoticon
on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L73 -
remove_symbols
on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L75
The remaining calls to methods that end with _filter
(e.g. japanese_bad_words_filter
, doc_len_filter
, etc.) are actually filter methods that return booleans so they should be OK.