OpusCleaner
OpusCleaner copied to clipboard
Corpus error detections
Some corpora would be contaminated with xml/html/csv and other types of artefacts.
We should have a category of filters that is just grep -n
the full dataset with this, and check if we actually have this corruption or not. The front-end output should be just true/false at this point. And if true, it should output line numbers maybe?