ValueError: large.bin has wrong file format!
FastText model downloading fails quite often, especially when using the "large" model.
A workaround is to pre-download the model with wget:
filters_dir="/builds/worker/.local/lib/python3.10/site-packages/opuscleaner/filters"
wget -O "${filters_dir}/large.bin" https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
I think requests.get is not robust enough without retries, so it just fails periodically and wget does a lot more to ensure reliable downloading.
I guess this is was because opuscleaner in filter mode runs several processes at the same time that write to the same file and the ones that start later do not download it and just try to use it while the first one is downloading. I haven't had any issues with this while using it with the interface because it's only one process, I think. Not sure how it could be easily fixed.
You could change https://github.com/hplt-project/OpusCleaner/blob/main/opuscleaner/filters/fasttext_filter.py to download to a temporary file and then os.rename it to its final filename when finished. Either it will be in place and complete, or not there and be downloaded. If it is there by the time you call os.rename it should be safe to ignore the OSError that rename causes. Although you want to check the OSError is not a permission error, don't want those to be silently ignored I suppose.