bandersnatch icon indicating copy to clipboard operation
bandersnatch copied to clipboard

Run in cluster mode under limited network environment

Open r00t1900 opened this issue 3 years ago • 3 comments

description

Here is a situation that using bandersnatch to sync with PYPI in one single server is too fragile and slow. For example, under some normal network environment that are not CDN-like pretty good network environment, with average download speed of 5MB/s, a full sync can take up to 30 days. Well, if get the server's load of https://pypi.org involved, the speed can come down to 2MB/s and we will take 75 days to finish an initial full sync, which is a little unacceptable. Something need to be done to avoid this.

idea

So I am thinking if we can run bandersnatch in cluster mode:

  • place serveral servers under different network, so that the effect caused by local network limited can be ignored.
  • each servers download a piece of data from pypi.
  • when all servers are done, transfer them to one of the server and then get them combined into a complete one as its origin, by local area network(with speed up to 1Gbps) or hard drive delivering.
  • with a complete copy of data, can execute the bandersnatch final step: "generate global index page". Now we can break our network limit and make it much more quicker.

more

In a real case, our network are limited to 5MB/s max. So we are seeking for a way to break the limit and come out with the thought of cluster. Can bandersnatch get this done by configuring properly or cooperated with other software?

r00t1900 avatar Mar 20 '22 15:03 r00t1900

Hi there,

Maybe a simpler start point is to ensure you're using workers at the maximum 9 in the bandersnatch.conf? If so, that's still capping you at the 5MB/s?

workers = 3
  • https://bandersnatch.readthedocs.io/en/latest/mirror_configuration.html#workers
    • e.g. https://github.com/pypa/bandersnatch/blob/main/src/bandersnatch/tests/ci.conf#L10

That said, another simple approach I suggest we could possible start with is:

  • add a generate_global_index bool in config
  • With this set to True, it wouldn't write out the global html file after a sync
  • Then you could use filters to start as many bandersnatch instances as you want to sync it's 'shard' / 'partition' of packages

Once they are all doing partial upgrades you could run a central full sync to generate the main index.html. Open to other ideas here too. Feel free to share.

cooperlees avatar Mar 20 '22 18:03 cooperlees

Here is my network maximum is 5MB/s, no matter how many worker I set on the same local instance. But I can have 5MB/s in company and another 5MB/s at home. I would like to make the use of both of them.

r00t1900 avatar Mar 22 '22 14:03 r00t1900

add a generate_global_index bool in configI can not find this in documentation, where should I add to?  I add it to the [mirror] section but nothing work. I really need the only download but not performing "generating global index page" because error happens and this took too many times.

r00t1900 avatar Apr 07 '22 00:04 r00t1900