chess-alpha-zero
chess-alpha-zero copied to clipboard
Distributed version
The distributed version of this project is ready to be used but we need to find a FTP server in internet that enable us to upload and download files of size bigger than 30MB in order to be able to store the best model configuration and its weights.
I signed up in a free hosting server with FTP support but I just realized it's limited to files up to 16MB so we cannot use it for large model configurations (because the weight file will be larger than 30MB).
If somebody wants to help with this just replace in
config.py
the FTP credentials lines with your good one and I will merge the change as soon as you do it.
Regards.
30MB after compression? How much storage space/bandwidth are you looking at in total?
@prusswan, for the "distributed.py" configuration model the weights file is about 30MB. I think 35MB would be enough.
How about the bandwidth? (how often the files need to be uploaded/downloaded)
More o less it would be like: 30MB x an average of 1 read/write per minute x number of users participating.
20 users training would need a daily bandwidth of almost 1 TB. It would required probably a FTP server running in a local (private) local machine open as a server for the rest of users. I don't know if a good public (payable) server is going to serve 30TB of FTP bandwidth at month.
Looks like bandwidth (due to data size and frequency) will be the bottleneck. Could the frequency be further reduced? If not probably you are looking at a resource from some school/institution
Also, in addition to bandwidth, there are code related considerations to make the distributed model more successful. I'm afraid the amount of help this project will be getting will be significantly lowered by the lack of portability (uvloop doesn't have a windows port, so windows users will simply not be able to help...)
Changes to "conform" to Alpha Zero's way - such as - updating the network all the time, instead of having a separate evaluation mechanism have been briefly discussed in the related Go project - Leela Zero. One of the ideas to go for in the future is that for distributed work, there won't be a global "best net" update every step, but rather that update can happen periodically, instead of doing exactly what Alpha Zero did when we have such a different working environment.
The best ideas seem to be - gradual changes in that direction over time (slowly...) to see what can still work for this distributed model.
Bandwidth doesn't seem to come even close to the numbers you stated when you don't need to upload/download a new net all the time.
you are right, @grolich.