bandersnatch
bandersnatch copied to clipboard
Run in cluster mode under limited network environment
description
Here is a situation that using bandersnatch to sync with PYPI in one single server is too fragile and slow. For example, under some normal network environment that are not CDN-like pretty good network environment, with average download speed of 5MB/s, a full sync can take up to 30 days. Well, if get the server's load of https://pypi.org involved, the speed can come down to 2MB/s and we will take 75 days to finish an initial full sync, which is a little unacceptable. Something need to be done to avoid this.
idea
So I am thinking if we can run bandersnatch in cluster mode:
- place serveral servers under different network, so that the effect caused by local network limited can be ignored.
- each servers download a piece of data from pypi.
- when all servers are done, transfer them to one of the server and then get them combined into a complete one as its origin, by local area network(with speed up to 1Gbps) or hard drive delivering.
- with a complete copy of data, can execute the bandersnatch final step: "generate global index page". Now we can break our network limit and make it much more quicker.
more
In a real case, our network are limited to 5MB/s max. So we are seeking for a way to break the limit and come out with the thought of cluster. Can bandersnatch get this done by configuring properly or cooperated with other software?
Hi there,
Maybe a simpler start point is to ensure you're using workers at the maximum 9 in the bandersnatch.conf? If so, that's still capping you at the 5MB/s?
workers = 3
- https://bandersnatch.readthedocs.io/en/latest/mirror_configuration.html#workers
- e.g. https://github.com/pypa/bandersnatch/blob/main/src/bandersnatch/tests/ci.conf#L10
That said, another simple approach I suggest we could possible start with is:
- add a
generate_global_indexbool in config - With this set to True, it wouldn't write out the global html file after a sync
- Then you could use filters to start as many bandersnatch instances as you want to sync it's 'shard' / 'partition' of packages
- The Regex Filter could be one plugin used to do this
Once they are all doing partial upgrades you could run a central full sync to generate the main index.html. Open to other ideas here too. Feel free to share.
Here is my network maximum is 5MB/s, no matter how many worker I set on the same local instance. But I can have 5MB/s in company and another 5MB/s at home. I would like to make the use of both of them.
add a generate_global_index bool in configI can not find this in documentation, where should I add to? I add it to the [mirror] section but nothing work. I really need the only download but not performing "generating global index page" because error happens and this took too many times.