OpusCleaner Corpus error detections

Corpus error detections

Open XapaJIaMnu opened this issue 2 years ago • 2 comments

Some corpora would be contaminated with xml/html/csv and other types of artefacts.

We should have a category of filters that is just grep -n the full dataset with this, and check if we actually have this corruption or not. The front-end output should be just true/false at this point. And if true, it should output line numbers maybe?

Jun 30 '22 14:06 XapaJIaMnu

OpusCleaner OpusCleaner copied to clipboard

Corpus error detections

OpusCleaner
OpusCleaner copied to clipboard