leela-chess icon indicating copy to clipboard operation
leela-chess copied to clipboard

#338 Issue: Re-enabling Supervised Learning.

Open Zeta36 opened this issue 6 years ago • 11 comments

#338 Issue: Re-enabling Supervised Learning. Also fixing some pgn parser bugs:

  1. Fixing a problem because some PGN files have this movement format: 1.d4 (no space after the dot) instead of 1. d4 (pgn.cpp lines 68 to 76).
  2. Skipping correctly the comments. It's not enough to check for the opened char '{', we have to take care we found the closer '}' before continuing. (pgn.cpp line 59).
  3. If 'cfg_supervise' is not empty we are generating supervised data and so we don't need to load and initialize a network from a file (main.cpp line 373 & 369).

I had some conflicts with the current pgn.cpp file but I think my changes are better than those currently in 'next' branch:

  1. The current pgn.cpp does not take into account that some pgn files have this movement format: 1.d4 (no space after the dot) instead of 1. d4.

For example, imagine this: 1.e4 e6 2.d4 d5 3.Nd2 c5...

In the current file we have:

// Skip the move numbers
    if (s.back() == '.') {
      continue;
    }

but this fails because as I say there is some PGN files with a valid format where the first movement is right after the dot without space: 1.d4 (no space after the dot) instead of 1. d4

I fixed this using the master version of the pgn.cpp file and doing the changes needed to attend both formats (with and without spaces after the dot).

  1. Also the way it tries to skip comments is wrong since some comments have various words and we need to iterate until we find the end char '}' because in other case after continue; we get with "is_ >> s;" a word in the middle of the comment. The correct way is to iterate like I did in line 59.

For example, imagine this: ...48. Rh1 Rxd7 49. Re1 {Black forfeits on time} 1-0

In the current file we have:

// Skip comments
    if (s.front() == '{' || s.front() == '[' || s.back() == '}' || s.back() == ']') {
      continue;
    }

But then after the "continue;" the line:

is_ >> s;

would fill s = "forfeits" what it's wrong.

We need to do the do{..}while(..); I used to reach the '}' before continuing.

Zeta36 avatar May 01 '18 07:05 Zeta36

I also fixed a bug in the PGN parser when two or more lines (two or more empty spaces) follow between a game and the next one. Currently if this happens the parser breaks with an error.

Zeta36 avatar May 01 '18 10:05 Zeta36

That's strange. Can you please share the lines of the first game of your games.pgn file?

Zeta36 avatar May 01 '18 17:05 Zeta36

@Zeta36 After re-build without src/pgn.cpp changes:

root@deep:~/leela-chess/build# ./lczero --supervise gamesh.pgn Using 2 thread(s). Found 6 existing chunks in supervise-gamesh/training Processed 1355 gamesInvalid game in gamesh.pgn Writing chunk 6

[Event "?"]
[Site "?"]
[Date "2018.01.08"]
[Round "1"]
[White "Stockfish 080118 64 POPCNT"]
[Black "Stockfish 080118 64 POPCNT"]
[Result "1/2-1/2"]
[ECO "A54"]
[Opening "Old Indian"]
[Variation "Ukrainian Variation , 4.Nf3"]
[TimeControl "10+0.1"]
[PlyCount "108"]

1. d4 Nf6 2. c4 d6 3. Nc3 e5 4. Nf3 e4 5. Ng5 Bf5 6. g4 Bxg4 7. Bg2 Nbd7 8.
Ngxe4 Nxe4 9. Bxe4 c6 10. Rg1 Bh5 11. Qb3 Qb6 12. d5 c5 13. Bd3 Qxb3 14.
axb3 Ne5 15. Bf5 g6 16. Bh3 f5 17. Bf4 a6 18. Bxe5 dxe5 19. e4 f4 20. Kd2
Bd6 21. Kc2 Kf7 22. Bg4 Bxg4 23. Rxg4 Be7 24. f3 h5 25. Rg2 g5 26. d6 Bxd6
27. Rxg5 Rag8 28. Rf5+ Ke6 29. Rd1 Rg2+ 30. Kb1 Rhg8 31. Ka2 Rxh2 32. Nd5
Rg6 33. b4 cxb4 34. c5 Bb8 35. Rf8 Ba7 36. Re8+ Kd7 37. Re7+ Kc8 38. Rh7
Rgg2 39. Rh8+ Kd7 40. Nxb4+ Ke7 41. Rh7+ Kf6 42. Rd6+ Kg5 43. Nd3 Bxc5 44.
Nxc5 Rxb2+ 45. Ka3 Ra2+ 46. Kb4 Rhb2+ 47. Kc4 Rc2+ 48. Kb3 Rcb2+ 49. Kc4
Rc2+ 50. Kb3 Rab2+ 51. Ka3 Ra2+ 52. Kb4 Rcb2+ 53. Kc3 Rc2+ 54. Kb4 Rcb2+
1/2-1/2
root@deep:~/leela-chess/training/tf# ./parse.py configs/example.yaml 
dataset:
  input: /root/leela-chess/build/supervise-gamesh/training
  num_chunks: 6
  train_ratio: 0.9
gpu: 0
model:
  filters: 64
  residual_blocks: 6
name: kb1-64x6
training:
  batch_size: 2048
  lr_boundaries:
  - 100000
  - 130000
  lr_values:
  - 0.02
  - 0.002
  - 0.0005
  path: /root/leela-chess/networks
  policy_loss_weight: 1.0
  shuffle_size: 524288
  total_steps: 140000
  value_loss_weight: 1.0

Not enough chunks

parse.py

    if len(chunks) < num_chunks:
        print("Not enough chunks")
        sys.exit(1)

hsntgm avatar May 01 '18 18:05 hsntgm

You can run your pgn thru pgn-extractor which writes out the file in the desired format. The main issue with supervised learning is the deliberate exception thrown.

If I comment out that line, the training works. Why did someone throw an exception in that line and disable supervised learning?

ganeshkrishnan1 avatar May 04 '18 00:05 ganeshkrishnan1

@ganeshkrishnan1 I added it because I wasn't sure if it would still work, and I didn't have time to test it when I was making other more important changes that I thought would break the SL process. So you tried it and it works? I guess you had to specify a network weight file, I think this PR removes that requirement. I think also it will output training format v1, not v2. That's a more serious problem, it looks like this PR fixes that also.

@Zeta36 I think it's probably not needed to do Network::set_format_version(2) since Network.cpp defaults to that.

Edit: Also can you remerge with latest upstream/next branch? There are some conflicts.

killerducky avatar May 04 '18 22:05 killerducky

thanks @killerducky
The training did output files after I removed the exception but I couldn't run the trainer on the data since parse.py file no longer exists and chunkparser.py throws an error.

I will wait for this merge and try one more time

ganeshkrishnan1 avatar May 05 '18 00:05 ganeshkrishnan1

@ganeshkrishnan1 i asked same question to community with a rough tongue "where is parse.py and other files". @Error323 told me parse.py | leela_to_proto.py and supervised_parse.py combined in train.py

Also there is start.sh which calls train.py and upload games to network automatically. But there is something tricky here which i miss i can't manage self training part still.

I test it on my low hardware with tensorflow-gpu 1.7 but process is endless.

It seems can't parse .gz file. My data is only 5Mb. Nothing changes supervised or self i can't get new weights.

sorting 1 chunks...[done] training.0.gz - training.0.gz Using 4 worker processes. Using 4 worker processes. 2018-05-05 18:05:47.938388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: GeForce GTX 650 Ti BOOST major: 3 minor: 0 memoryClockRate(GHz): 1.0845 pciBusID: 0000:01:00.0 totalMemory: 1.95GiB freeMemory: 1400.94MiB 2018-05-05 18:05:47.938551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0 2018-05-05 18:05:48.354191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-05-05 18:05:48.370575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 2018-05-05 18:05:48.370627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N 2018-05-05 18:05:48.370767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1796 MB memory) -> physical GPU (device: 0, name: GeForce GTX 650 Ti BOOST, pci bus id: 0000:01:00.0, compute capability: 3.0) Using 1 evaluation batches

Maybe it is something about the yaml config require a very large hardware footprint.I lowered all training definations but still process goes on all cpu cores and evaluation not start.

Also in tfprocess.py there are tricky options.

#You need to change the learning rate here if you are training
        # from a self-play training set, for example start with 0.005 instead.
        opt_op = tf.train.MomentumOptimizer(
learning_rate=self.learning_rate, momentum=0.9, use_nesterov=True)
 #For training from a (smaller) dataset of strong players, you will
        # want to reduce the factor in front of self.mse_loss here.
        pol_loss_w = self.cfg['training']['policy_loss_weight']
        val_loss_w = self.cfg['training']['value_loss_weight']
loss = pol_loss_w * self.policy_loss + val_loss_w * self.mse_loss + self.reg_term

hsntgm avatar May 05 '18 15:05 hsntgm

@hsntgm You are correct. its now train.py The command is something like this

./train.py --cfg=configs/aihello.yaml --output=/data/aihello.com/training/output

ganeshkrishnan1 avatar May 06 '18 00:05 ganeshkrishnan1

ping @Zeta36 can you please merge with next again? There are some conflicts.

killerducky avatar May 08 '18 01:05 killerducky

The pipeline assumes 1 game per file, so main.cpp needs to be fixed.

auto chunker = OutputChunker{dir.string() + "/training", true, 15000};

Should be changed to

#define GAMES_PER_FILE 1
auto chunker = OutputChunker{dir.string() + "/training", true, GAMES_PER_FILE};

dkappe avatar May 08 '18 02:05 dkappe

This still has issue parsing pgn files especially if there are multiple games in the pgn. Can we not use a standard pgn parser library similar to python uci-chess?

ganeshkrishnan1 avatar May 08 '18 04:05 ganeshkrishnan1