leela-chess icon indicating copy to clipboard operation
leela-chess copied to clipboard

Added support for using multiple GPUs on training server when training the model

Open gyathaar opened this issue 6 years ago • 5 comments

Added code to train the model on more than one GPU

When using a single GPU the code is basically unchanged (tf.Variable() is replaced with tf.get_variable() to be able to reuse the variables, and model loading is changed so it can load models previously trained with multiple GPUs)

In my own tests a model training with same batch size runs in around 65% of the time with 2 GPUs compared to 1 GPU... doubling the batch size makes training take about 30% longer than a single GPU with unchanged batch size

Added configurable param for which device to collect and apply the gradients from all the GPUs, in my tests it was running about 10% faster when doing this on the CPU compared to one of the GPUs, but this may be different on other systems depending on the interconnects between the GPUs

gyathaar avatar May 09 '18 11:05 gyathaar

has this code undergone end to end testing? tfprocess.py has drastic changes and I am not sure how I can check if the training is progressing better with multi gpu support.

My server has 4 gpus and this works but I can't vouch for the model generated from it

ganeshkrishnan1 avatar May 21 '18 14:05 ganeshkrishnan1

Ran this change for 24 hours and everything seems to be ok including the model built.

Although note that this seems incompatible with the checkpoint generated from single GPU core (due to error clearing the .meta file)

Also any reason you are applying the final steps in CPU and not GPU?

ganeshkrishnan1 avatar May 22 '18 12:05 ganeshkrishnan1

I tested training on same set of games with same random seed and got same loss curves as the normal code.. have not tested extensively if the trained nets will play identical games after x training runs.

You can train the final steps on the GPU if you like.. in my tests the training was about 10% faster when performing it in the CPU, but this will vary depending on what CPU and GPUs you have, and what kind of interconnect speeds between the GPUs

The checkpoint files generated by the main branch will not be able to be loaded (and other way) due to meta data differences (storing device ids and path vs not.. )

gyathaar avatar May 22 '18 14:05 gyathaar

The meta file incompatibility is going to be a deal breaker for many. Possibly even lczero network. This effectively means we have to start training from scratch unless you can modify the net_to_model.py to create checkpoints compatible with the multi gpu version from the trained model

ganeshkrishnan1 avatar May 22 '18 14:05 ganeshkrishnan1

The meta file incompatibility is not an issue.. you can convert an existing weights.txt file into a checkpoint using the net_to_model.py program.. it is already compatible with the modified version tfprocess.py

gyathaar avatar May 23 '18 06:05 gyathaar