Jaume Zaragoza
Jaume Zaragoza
I checked deleting the fasttext model from the filesystem and Bicleaner AI triggers the download only when hardrules is enabled. This probably happened when translations still had hardrules enabled. So...
This is probably failing because it is using binary shortlist parameter with a shortlist in text format ``` ... --shortlist /builds/worker/fetches/lex.s2t.pruned false ... ``` and will be fixed by #1169
So, regarding mono hplt this seems to be the cleaning summary: ``` 22735836 HPLT_LID_SEGMENT 29798622 DUPLICATE 2028464 CLEAN_LID 3920256 RATIO_ALPHA 3683762 RATIO_CHARS 11074337 TOO_LONG ``` things that I think may...
Ok, so I took a look at parallel and it's a bit difficult to tell what's happening with the cleaning without debugging log for filters. Probably it was in part...
I don't have a list. There is an extensive list here: https://opus.nlpl.eu/mt/release-history, but that does not give us the "best" model.
oh, sorry, I didn't remember this was due to the licenses. Could we somehow add a link to the dataset viewer? @evgenyrp
Maybe even a link to the exact row in the viewer? https://huggingface.co/datasets/facebook/bouquet/viewer/spa_Latn?views[]=spa_latn_dev&row=2
I opened the issue for the record, but it seems that the dash issues are just frequent on old models with the short sentences issues, so closing it for now.
Throughput has gone up from 80k tok/s to 130 tok/s for teacher training.
Maybe something like [this tee](https://github.com/hplt-project/OpusCleaner/blob/5fef45344decce1275b3c1a60ec09cda61d478a4/opuscleaner/clean.py#L313) that counts lines at the beginning and at the end of each step. Or using enabling that tee option, then count each step size.