firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Download jobs fail as we hit statmt rate limit

Open XapaJIaMnu opened this issue 3 years ago • 4 comments

A lot of the parallel corpora are located on statmt.org. As far as I could gather, the download-data step of Snakemake executes all corpora downloads in parallel, which unfortunately means that some of them will fail due to rate limit.

Looking at the snakemake file: https://github.com/mozilla/firefox-translations-training/blob/ddba8ebf2558a75a77817c129fd24b5db3ec2987/Snakefile#L259 It seems that download corpus is a one thread job. How does it parallelise? How can I restrict parallelisation?

XapaJIaMnu avatar Jan 27 '22 19:01 XapaJIaMnu

As a temporary workaround you can either increase retries (--restart-times 3) or increase the number of threads for this job or both. Parallelization is controlled with --cores all option and threads rules parameter. So if you specify 16 threads on 64 cores machine, it will be just 4 jobs running at the same time.

A permanent fix would require introducing some artificial random delay in the downloader. Do you know what's the rate limit for statmt?

eu9ene avatar Jan 27 '22 20:01 eu9ene

@kpu what's our rate limit?

Also I noticed that if some jobs fail to download, the execution of the pipeline stops, but doing make run-local again, doesn't retry those failed jobs, but instead moves to cleaning? How can I force it to retry the failed download jobs?

XapaJIaMnu avatar Jan 27 '22 21:01 XapaJIaMnu

<Location /ngrams>
   MaxConnPerIP 4
</Location>
<Location /cc-100>
   MaxConnPerIP 1
</Location>
<Location /cc-aligned>
   MaxConnPerIP 1
</Location>
<Location /cc-english>
   MaxConnPerIP 1
</Location>

<IfModule mod_ratelimit.c>
  <Location /ngrams>
    SetOutputFilter RATE_LIMIT
    # SetEnv rate-limit 50000
    SetEnv rate-limit 1000
  </Location>
  <Location /cc-100>
    SetOutputFilter RATE_LIMIT
    SetEnv rate-limit 1000
  </Location>
  <Location /cc-aligned>
    SetOutputFilter RATE_LIMIT
    SetEnv rate-limit 1000
  </Location>
  <Location /cc-english>
    SetOutputFilter RATE_LIMIT
    SetEnv rate-limit 1000
  </Location>
</IfModule>

I want to raise these. Plotting to update to a newer server but that will take time.

kpu avatar Jan 27 '22 21:01 kpu

Ideally we want the downloader to run in single thread, and the rest of the jobs to use the multithreaded configuration. At the moment I'm running the pipeline in one thread to make sure all downloads succeed. Which is horrible, because the cleaning steps are interleaved with the downloading steps, which means they also run in a single thread....

XapaJIaMnu avatar Jan 28 '22 14:01 XapaJIaMnu

This doesn't affect Taskcluster anymore due to our internal caching. Feel free to re-open if anyone hits this again in Snakemake.

gregtatum avatar Apr 09 '24 21:04 gregtatum