leela-chess
leela-chess copied to clipboard
Added support for using multiple GPUs on training server when training the model
Added code to train the model on more than one GPU
When using a single GPU the code is basically unchanged (tf.Variable() is replaced with tf.get_variable() to be able to reuse the variables, and model loading is changed so it can load models previously trained with multiple GPUs)
In my own tests a model training with same batch size runs in around 65% of the time with 2 GPUs compared to 1 GPU... doubling the batch size makes training take about 30% longer than a single GPU with unchanged batch size
Added configurable param for which device to collect and apply the gradients from all the GPUs, in my tests it was running about 10% faster when doing this on the CPU compared to one of the GPUs, but this may be different on other systems depending on the interconnects between the GPUs
has this code undergone end to end testing? tfprocess.py has drastic changes and I am not sure how I can check if the training is progressing better with multi gpu support.
My server has 4 gpus and this works but I can't vouch for the model generated from it
Ran this change for 24 hours and everything seems to be ok including the model built.
Although note that this seems incompatible with the checkpoint generated from single GPU core (due to error clearing the .meta file)
Also any reason you are applying the final steps in CPU and not GPU?
I tested training on same set of games with same random seed and got same loss curves as the normal code.. have not tested extensively if the trained nets will play identical games after x training runs.
You can train the final steps on the GPU if you like.. in my tests the training was about 10% faster when performing it in the CPU, but this will vary depending on what CPU and GPUs you have, and what kind of interconnect speeds between the GPUs
The checkpoint files generated by the main branch will not be able to be loaded (and other way) due to meta data differences (storing device ids and path vs not.. )
The meta file incompatibility is going to be a deal breaker for many. Possibly even lczero network. This effectively means we have to start training from scratch unless you can modify the net_to_model.py to create checkpoints compatible with the multi gpu version from the trained model
The meta file incompatibility is not an issue.. you can convert an existing weights.txt file into a checkpoint using the net_to_model.py program.. it is already compatible with the modified version tfprocess.py