PARL icon indicating copy to clipboard operation
PARL copied to clipboard

How to run `benchmark/torch/AlphaZero`?

Open rydeveraumn opened this issue 2 years ago • 8 comments

Hey all,

I am wondering how I can run the benchmark/torch/AlphaZero code. When I follow the instructions the code does not run out of the box. I would like to train a model and submit to Kaggle. I will add all of the things I am seeing so far when I try to run it.

rydeveraumn avatar Nov 23 '22 16:11 rydeveraumn

Hello. Have you run all the three steps in the instructions?

TomorrowIsAnOtherDay avatar Nov 23 '22 16:11 TomorrowIsAnOtherDay

Hey @TomorrowIsAnOtherDay is I sure have. There are a number of different errors that seem to be happening for me:

  • Expected more than 1 value per channel when training, got input size torch.Size([1, 128]) - which is coming from the batch normalization in connect4_model.py. I removed batch normalization then another error comes up
  • ValueError: optimizer got an empty parameter list - which seems to be coming from the fact that Connect4Model outputs an empty list when you call model.parameters().

Haven't got farther than this. I was able to run examples/AlphZero but then the gen_submission.py had errors on Kaggle. Not sure what next steps should be.

Thanks for the reply!

BTW:

parl.__version__ = 2.0.5 paddlepaddle.__version__ = 2.3.2

Steps: Clone develop repository xparl start --port 8010 --cpu_num 5 -- everything works fine cd benchmark/torch/AlphaZero python main.py

rydeveraumn avatar Nov 23 '22 16:11 rydeveraumn

It seems that you have installed two deep learning frameworks (torch & paddle). PARL supports these two frameworks, but it imports paddle by default. To specify torch as the backend framework, try exporting the following environment variable:

xparl stop
export PARL_BACKEND = torch

and run the instruction again.

TomorrowIsAnOtherDay avatar Nov 23 '22 17:11 TomorrowIsAnOtherDay

Okay let me give it a shot! Will report back soon

rydeveraumn avatar Nov 23 '22 17:11 rydeveraumn

Now I am getting a ton of No vacant CPU resources at the momemt errors image

image

rydeveraumn avatar Nov 23 '22 18:11 rydeveraumn

@TomorrowIsAnOtherDay it does show that I have 5 vacant cpus: image

rydeveraumn avatar Nov 23 '22 19:11 rydeveraumn

Okay I think I figured out what was happening:

When you are setting a different number of CPUs than the one suggested in the README I also had to modify the numActors in the main.py script and depending on what you do there you also have to modify arenaCompare. If you want I can add a small PR just incase anyone is interested in using this

rydeveraumn avatar Nov 23 '22 19:11 rydeveraumn

Exactly. The commands related to xparl only launch a CPU cluster, and they will not change the number of actors used for training. Users must modify the num_actors in main.py to change the number of actors.

TomorrowIsAnOtherDay avatar Nov 24 '22 02:11 TomorrowIsAnOtherDay