OpusCleaner icon indicating copy to clipboard operation
OpusCleaner copied to clipboard

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Results 58 OpusCleaner issues
Sort by recently updated
recently updated
newest added

- List category names - List all datasets in a category - Give concatenated version of all datasets in a category

Already implemented but doubt in quality. Note to self: https://github.com/zaboople/klonk/blob/master/TheGURQ.md

Imagine: you can just copy the cleaning configuration files to your directory, or get them from your git repository, and run a command, and that will download the data to...

enhancement
component:execution

For easier filter development and in general for power users, one should be able to build the code and run it from the local directory, as opposed to installing it...

Add at least some testcases for things like `opuscleaner-sample` and `opuscleaner-clean`.

Integrate the automated analytics that opus filter can generate. And the notes I took at Prompsit. Also, domain analytics would be very interesting as well!

A reminder to myself to fix in the near future CCMatrix and CCALigned contain a lot of quotes that are cut off arbitrary on both sides. Sometimes one side will...

enhancement

Automatically building & pushing a docker container should make installation a lot easier. pip wheels would also be nice for HPC + conda environments I suppose. But docker containers are...

enhancement

At the moment categories are hardcoded to `clean`, `medium` and `dirty`. We should be able to set more categories.

Hello. :) Just some notes, maybe some info can be added to readme. I was installing on Ubuntu 20.04 (native python is 3.8). Install failed with 3.8, I added 3.10...