firefox-translations-training
firefox-translations-training copied to clipboard
Download jobs fail as we hit statmt rate limit
A lot of the parallel corpora are located on statmt.org. As far as I could gather, the download-data step of Snakemake executes all corpora downloads in parallel, which unfortunately means that some of them will fail due to rate limit.
Looking at the snakemake file: https://github.com/mozilla/firefox-translations-training/blob/ddba8ebf2558a75a77817c129fd24b5db3ec2987/Snakefile#L259 It seems that download corpus is a one thread job. How does it parallelise? How can I restrict parallelisation?
As a temporary workaround you can either increase retries (--restart-times 3
) or increase the number of threads for this job or both. Parallelization is controlled with --cores all
option and threads
rules parameter. So if you specify 16 threads on 64 cores machine, it will be just 4 jobs running at the same time.
A permanent fix would require introducing some artificial random delay in the downloader. Do you know what's the rate limit for statmt?
@kpu what's our rate limit?
Also I noticed that if some jobs fail to download, the execution of the pipeline stops, but doing make run-local
again, doesn't retry those failed jobs, but instead moves to cleaning? How can I force it to retry the failed download jobs?
<Location /ngrams>
MaxConnPerIP 4
</Location>
<Location /cc-100>
MaxConnPerIP 1
</Location>
<Location /cc-aligned>
MaxConnPerIP 1
</Location>
<Location /cc-english>
MaxConnPerIP 1
</Location>
<IfModule mod_ratelimit.c>
<Location /ngrams>
SetOutputFilter RATE_LIMIT
# SetEnv rate-limit 50000
SetEnv rate-limit 1000
</Location>
<Location /cc-100>
SetOutputFilter RATE_LIMIT
SetEnv rate-limit 1000
</Location>
<Location /cc-aligned>
SetOutputFilter RATE_LIMIT
SetEnv rate-limit 1000
</Location>
<Location /cc-english>
SetOutputFilter RATE_LIMIT
SetEnv rate-limit 1000
</Location>
</IfModule>
I want to raise these. Plotting to update to a newer server but that will take time.
Ideally we want the downloader to run in single thread, and the rest of the jobs to use the multithreaded configuration. At the moment I'm running the pipeline in one thread to make sure all downloads succeed. Which is horrible, because the cleaning steps are interleaved with the downloading steps, which means they also run in a single thread....
This doesn't affect Taskcluster anymore due to our internal caching. Feel free to re-open if anyone hits this again in Snakemake.