OpusCleaner icon indicating copy to clipboard operation
OpusCleaner copied to clipboard

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Results 58 OpusCleaner issues
Sort by recently updated
recently updated
newest added

Experimental branch that uses OpusFilter for all of the tools and processing. Ideally we'd add compatibility for our external filters (the .json files) as well because I like that extensibility...

Many filters need to know the languages of the two columns. It could be useful if those can be filled in automatically. The interface knows the language of the two...

enhancement
component:ui

In the category _lessons from Paracrawl_: It is sometimes very useful to split the actual filtering pipeline into a couple of steps that are then executed on different hardware. For...

enhancement
help wanted
component:execution

https://github.com/jelmervdl/empty-train/blob/5edf6e20b7fe381ca87d4abe9aa4fcf1985b63ef/main.py#L31 We should have optionally a list of possible values, or something like a datatype hint (eg int)

enhancement
component:ui

The last time I've worked with this it was using [OpenCC](https://pypi.org/project/OpenCC/). It is much more up to date and seems to have an active community. Las release from hanziconv is...

component:filter

Some corpora would be contaminated with xml/html/csv and other types of artefacts. We should have a category of filters that is just `grep -n` the full dataset with this, and...

enhancement

In PR #157 I added additional alphabet support. This information is available by professional translators in the CLDR data: https://github.com/unicode-org/cldr-json/blob/0876ec40e13d54c0a6b6456392802d4de7e059cb/cldr-json/cldr-misc-full/main/sl/characters.json It would be nice to consume that JSON and automate...

> https://github.com/facebookresearch/fastText has been archived by the owner on Mar 19, 2024. It is now read-only. `fastText` uses `numpy` ([link](https://github.com/facebookresearch/fastText/blob/1142dc4c4ecbc19cc16eee5cdd28472e689267e6/setup.py#L197)). `numpy` recently updated to `2.0.0`. Also, other dependencies can be...