ELF icon indicating copy to clipboard operation
ELF copied to clipboard

Updated ELF still returning exceeded memory error

Open downseq opened this issue 6 years ago • 8 comments

Using two-gtp: https://www.mankier.com/1/gogui-twogtp

System Spec: 150GB RAM TESLA V100 GPU

After 2.5 games, using the following settings:

./gtp.sh ~/v1.bin --gpu 0 --num_block 20 --dim 224 --mcts_puct 1.5 --batchsize 2 --mcts_rollout_per_batch 2 --mcts_threads 2 --mcts_rollout_per_thread 250 --resign_thres 0.00 --mcts_virtual_loss 1

From log:

[2018-10-17 17:46:47.497] [elf::ai::tree_search::MCTSAI_T-22] [info] [-1] MCTSAI Result: BestA: [B9][bi][191], MaxScore: 3, Info: -2.97157/3 (-0.990524), Pr: 0.0101511, child node: 21109 Action: 191 MCTS: 1239.9ms. Total: 1239.9ms. B<< B<< = B9 B<< B<< W>> play B B9 W<< W<< = W<< W<< W>> genmove w slurmstepd: error: Job [ommitted] exceeded memory limit (153936196 > 153600000), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: *** JOB [ommitted] ON [ommitted] CANCELLED AT 2018-10-18T04:46:48

Could this be a two-gtp side error? as i'm not sure how to self play without two-gtp using just the ELF system for competitive play (not self-play training).

#100 #94

downseq avatar Oct 17 '18 18:10 downseq

Using the command you provided, I observe that memory usage reaches an asymptote of roughly 2.5 GB.

Not sure whether twogtp would cause any issues. cc @qucheng

jma127 avatar Oct 19 '18 17:10 jma127

I found a work around by running games individually so that it resets memory usage, might confirm later if it was two-gtp related if I get some spare time.

For those that are interested this is the command that I ran to work around it (you can adjust number of games etc depending on how much memory is being used up):

#!/bin/bash
BLACK="player_b.sh"
WHITE="player_w.sh"
for i in {1..50}
do ./gogui-twogtp -black "$BLACK" -white "$WHITE" -games 1 \
  -size 19 -sgffile game_filename_$i -auto -verbose -debugtocomment -komi 7.5
 done

downseq avatar Oct 20 '18 11:10 downseq

twogtp will just be 2 copies of ELF, which will consume ~5G. Might exceed max mem on some hardware.

qucheng avatar Oct 23 '18 18:10 qucheng

For experimental reasons, would it be possible to direct me to where the code is to remove the entire subtree created by the AI after each move (not just the unused portion)?

downseq avatar Oct 24 '18 02:10 downseq

@downseq remove --persistent_tree would clean up the existing tree before each move. See here: https://github.com/pytorch/ELF/blob/master/src_cpp/elf/ai/tree_search/mcts.h#L142

yuandong-tian avatar Oct 24 '18 21:10 yuandong-tian

@downseq remove --persistent_tree would clean up the existing tree before each move. See here: https://github.com/pytorch/ELF/blob/master/src_cpp/elf/ai/tree_search/mcts.h#L142

Thanks for the confirmation.

So it seems keeping the subtree after each move is not set as default as it was in the AlphaGo Zero paper?

downseq avatar Oct 25 '18 02:10 downseq

@downseq It is always helpful. If the memory allows, this should always help in boosting the performance with zero additional cost. So why not?

yuandong-tian avatar Oct 26 '18 14:10 yuandong-tian

I think there is still some confusion as to whether it is on by default or not, but it seems like it is left on as default, so maybe I have misinterpreted your comment earlier.

downseq avatar Oct 26 '18 18:10 downseq