OpusCleaner
OpusCleaner copied to clipboard
OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
- List category names - List all datasets in a category - Give concatenated version of all datasets in a category
Already implemented but doubt in quality. Note to self: https://github.com/zaboople/klonk/blob/master/TheGURQ.md
Imagine: you can just copy the cleaning configuration files to your directory, or get them from your git repository, and run a command, and that will download the data to...
For easier filter development and in general for power users, one should be able to build the code and run it from the local directory, as opposed to installing it...
Add at least some testcases for things like `opuscleaner-sample` and `opuscleaner-clean`.
Integrate the automated analytics that opus filter can generate. And the notes I took at Prompsit. Also, domain analytics would be very interesting as well!
A reminder to myself to fix in the near future CCMatrix and CCALigned contain a lot of quotes that are cut off arbitrary on both sides. Sometimes one side will...
Automatically building & pushing a docker container should make installation a lot easier. pip wheels would also be nice for HPC + conda environments I suppose. But docker containers are...
At the moment categories are hardcoded to `clean`, `medium` and `dirty`. We should be able to set more categories.
Hello. :) Just some notes, maybe some info can be added to readme. I was installing on Ubuntu 20.04 (native python is 3.8). Install failed with 3.8, I added 3.10...