OpusCleaner
OpusCleaner copied to clipboard
OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
Experimental branch that uses OpusFilter for all of the tools and processing. Ideally we'd add compatibility for our external filters (the .json files) as well because I like that extensibility...
Many filters need to know the languages of the two columns. It could be useful if those can be filled in automatically. The interface knows the language of the two...
In the category _lessons from Paracrawl_: It is sometimes very useful to split the actual filtering pipeline into a couple of steps that are then executed on different hardware. For...
https://github.com/jelmervdl/empty-train/blob/5edf6e20b7fe381ca87d4abe9aa4fcf1985b63ef/main.py#L31 We should have optionally a list of possible values, or something like a datatype hint (eg int)
The last time I've worked with this it was using [OpenCC](https://pypi.org/project/OpenCC/). It is much more up to date and seems to have an active community. Las release from hanziconv is...
Some corpora would be contaminated with xml/html/csv and other types of artefacts. We should have a category of filters that is just `grep -n` the full dataset with this, and...
In PR #157 I added additional alphabet support. This information is available by professional translators in the CLDR data: https://github.com/unicode-org/cldr-json/blob/0876ec40e13d54c0a6b6456392802d4de7e059cb/cldr-json/cldr-misc-full/main/sl/characters.json It would be nice to consume that JSON and automate...
> https://github.com/facebookresearch/fastText has been archived by the owner on Mar 19, 2024. It is now read-only. `fastText` uses `numpy` ([link](https://github.com/facebookresearch/fastText/blob/1142dc4c4ecbc19cc16eee5cdd28472e689267e6/setup.py#L197)). `numpy` recently updated to `2.0.0`. Also, other dependencies can be...