dps icon indicating copy to clipboard operation
dps copied to clipboard

Data processing system for polyglot

Results 13 dps issues
Sort by recently updated
recently updated
newest added

Hello I am getting this error while running dedup_job. I am able to run sample_job and Korean_job but when I get this error in dedup_job. I am using conda env,...

On https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L64-L75 there are several cases where we are using `.filter` but instead it should be a `.map`. For example https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L73 calls https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/prep/japanese_prep.py#L64-L67 but in effect this is doing nothing...

Also, found `_remove_repeated_phrase` function is wrong. should be fixed later.

there is memory error when deduplicate Chinese data. ``` 23/04/19 19:44:17 WARN MemoryStore: Not enough space to cache rdd_7_0 in memory! (computed 176.2 MiB so far) 23/04/19 19:44:17 WARN BlockManager:...

# Background - Seems that we don't have to implement a lot of pre-processing to replace Japanese PII - because there are already some PII pre-processing in language agnostic processing....

# Background - Similar background to https://github.com/EleutherAI/dps/issues/50 - We might need to implement `japanese_spam_words_filter` as needed basis.

### Agenda - Some raw datasets can have empty or null text. - Using `filter` method in RDD or DF, text like `""` need to be ignored during this process....

bug