OpusCleaner
OpusCleaner copied to clipboard
OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
In the data filter tab, it would be nice to normalize the language tags to something like the BCP 47 language tag standard.
In hindsight, OpusFilter had the right idea here. OpusCleaner right now has filters, which take lines on their stdin and produce lines on their stdout. This model is really simple,...
When the dataset contains a consistent but easy-to-clean noise (e.g. space at the end of every line), running a filter that removes the space will render the whole diff trivial...
We need a license. The license should generally be open, but include a clause that prevents Mozilla from using this software unless they: 1. Clearly acknowledge the source of the...
 http://127.0.0.1:8000/frontend/index.html#/datasets/ECB-v1.en-mt/configuration is making my browser slow.
Merge in Lucas' changes: https://github.com/hplt-project/OpusCleaner/compare/main...lukasweymann:empty-train:main
I think the data isn't shuffled, or at least ECB has consecutive sentences. Shouldn't I be looking at a random representative sample of the data?
Asking where the elephant in the room is.
Let me enter free-form text notes on a corpus.