OpusCleaner icon indicating copy to clipboard operation
OpusCleaner copied to clipboard

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Results 58 OpusCleaner issues
Sort by recently updated
recently updated
newest added

Installation fails on Python 3.10 fails because the specified version of OpusFilter (2.6.0) requires fast-mosestokenizer, and there is no version for 3.10. I do not know if a later version...

bug
help wanted
good first issue

Filtering fails on some datasets, for example, en-ru OPUS XLEnt ``` [task 2024-04-17T19:48:57.880Z] [11/12:laser_similarity] Traceback (most recent call last): [task 2024-04-17T19:48:57.881Z] [11/12:laser_similarity] File "/builds/worker/.local/lib/python3.10/site-packages/opuscleaner/filters/../threshold.py", line 142, in wrapper [task 2024-04-17T19:48:57.881Z]...

FastText model downloading fails quite often, especially when using the "large" model. A workaround is to pre-download the model with wget: ``` filters_dir="/builds/worker/.local/lib/python3.10/site-packages/opuscleaner/filters" wget -O "${filters_dir}/large.bin" https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin ``` I think...

The current package `laserembeddings` is quite old and uses LASER 1. The new official package `laser_encoders` [supports the latest models and more languages](https://github.com/facebookresearch/LASER/tree/main/laser_encoders#laser-versions-and-associated-packages). It is slower to run though, so...

Enhance user experience by telling them which filter is responsible for which change. A tooltip when you hover over a change would suffice.

enhancement

"GET /languages/ HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1351, in do_open raise URLError(err) urllib.error.URLError: Please advise

Hi team, Is there a way to use one universal filter json file that can be used to run on all datasets in the train-parts directory? Right now, it seems...

enhancement

Something went wrong with fastText model downloading and opuscleaner-clean did not exit. It was killed on task timeout after 24 hours: https://firefox-ci-tc.services.mozilla.com/tasks/B5S8nc1OTI6hOxCkGDgG9A/runs/0/logs/public/logs/live.log I could successfully rerun the same exact task....

In practice I would have big noisy training data and sample clean data that is representative of the downstream task (e.g. wmt validation sets). It is still difficulty for me...

Running `opuscleaner-clean` (using python 3.8.18) immediately fails ``` (opuscleaner) [lofn]bhaddow: opuscleaner-clean data/train-parts/HPLT-v1.1.bs-en.filters.json Traceback (most recent call last): File "/home/shared/bhaddow/anaconda3/envs/opuscleaner/bin/opuscleaner-clean", line 5, in from opuscleaner.clean import main File "/mnt/startiger0/saga/raid0/bhaddow/code/OpusCleaner/opuscleaner/clean.py", line 25,...

bug