ELF icon indicating copy to clipboard operation
ELF copied to clipboard

Please explain the parameters

Open xphoniex opened this issue 6 years ago • 9 comments

I'm trying to train for 9x9 board using just a few self plays just to see the whole process (selfplay -> training NN using self-play result -> selfplaying using new NN) on my own machine. Can someone please explain what these parameters do?

  • q_min_size
  • q_max_size
  • num_reader

edit: apparently num_reader is the size of queues we share with trainer and q_size is the min/max of games for training to get started.

difference between:

  • eval_num_games vs. num_games
  • num_minibatch vs. batchsize

and why some parameters are repeated in start_server.sh like eval_num_games

start_client.sh

root=./myserver game=elfgames.go.game model=df_pred model_file=elfgames.go.df_model3 \
stdbuf -o 0 -e 0 python3 ./selfplay.py \
    --T 1    --batchsize 2 \
    --dim0 224    --dim1 224    --gpu 0 --gpu0 0 \
    --keys_in_reply V rv    --mcts_alpha 0.03 \
    --mcts_epsilon 0.25    --mcts_persistent_tree \
    --mcts_puct 0.85    --mcts_rollout_per_thread 10 \
    --mcts_threads 4    --mcts_use_prior \
    --mcts_virtual_loss 5   --mode selfplay \
    --num_block0 20    --num_block1 20 \
    --num_games 1    --ply_pass_enabled 160 \
    --policy_distri_cutoff 30    --policy_distri_training_for_all \
    --port 1234 \
    --no_check_loaded_options0    --no_check_loaded_options1 \
    --replace_prefix0 resnet.module,resnet init_conv.module,init_conv\
    --replace_prefix1 resnet.module,resnet init_conv.module,init_conv\
    --resign_thres 0.0    --selfplay_timeout_usec 10 \
    --server_id myserver    --use_mcts \
    --use_fp160 --use_fp161 \
    --use_mcts_ai2 --verbose

start_server.sh

save=./myserver game=elfgames.go.game model=df_kl model_file=elfgames.go.df_model3 \
    stdbuf -o 0 -e 0 python3.6 -u ./train.py \
    --mode train --num_reader 2    --batchsize 1 \
    --num_games 1    --keys_in_reply V \
    --T 1    --use_data_parallel \
    --num_minibatch 1    --num_episode 10 \
    --mcts_threads 4    --mcts_rollout_per_thread 20 \
    --keep_prev_selfplay    --keep_prev_selfplay \
    --use_mcts     --use_mcts_ai2 \
    --mcts_persistent_tree    --mcts_use_prior \
    --mcts_virtual_loss 5     --mcts_epsilon 0.25 \
    --mcts_alpha 0.03     --mcts_puct 0.85 \
    --resign_thres 0.01    --gpu 0 \
    --server_id myserver     --eval_num_games 1 \
    --eval_winrate_thres 0.55     --port 1234 \
    --q_min_size 1     --q_max_size 2 \
    --save_first     \
    --num_block 20     --dim 224 \
    --weight_decay 0.0002    --opt_method sgd \
    --bn_momentum=0 --num_cooldown=50 \
    --expected_num_client 496 \
    --selfplay_init_num 0 --selfplay_update_num 0 \
    --eval_num_games 0 --selfplay_async \
    --lr 0.01    --momentum 0.9     1>> log.log 2>&1 &

my client simply doesn't stop self-playing and the server has an error

[2018-09-10 20:11:03.791] [elf::distributed::Reader-21] [info] Mon Sep 10 20:11:03 2018, Reader: no message, Stats: 14/0/0, wait for 10 sec ... 
[2018-09-10 20:11:13.791] [elf::distributed::Reader-21] [info] Mon Sep 10 20:11:13 2018, Reader: no message, Stats: 14/0/0, wait for 10 sec ... 
[2018-09-10 20:11:23.861] [elf::distributed::Reader-21] [info] Mon Sep 10 20:11:23 2018, Reader: no message, Stats: 15/0/0, wait for 10 sec ... 
[2018-09-10 20:11:33.865] [elf::distributed::Reader-21] [info] Mon Sep 10 20:11:33 2018, Reader: no message, Stats: 16/0/0, wait for 10 sec ... 
[2018-09-10 20:11:43.865] [elf::distributed::Reader-21] [info] Mon Sep 10 20:11:43 2018, Reader: no message, Stats: 16/0/0, wait for 10 sec ... 
[2018-09-10 20:11:53.865] [elf::distributed::Reader-21] [info] Mon Sep 10 20:11:53 2018, Reader: no message, Stats: 16/0/0, wait for 10 sec ... 
[2018-09-10 20:12:03.865] [elf::distributed::Reader-21] [info] Mon Sep 10 20:12:03 2018, Reader: no message, Stats: 16/0/0, wait for 10 sec ... 
[2018-09-10 20:12:13.865] [elf::distributed::Reader-21] [info] Mon Sep 10 20:12:13 2018, Reader: no message, Stats: 16/0/0, wait for 10 sec ... 
[2018-09-10 20:12:23.865] [elf::distributed::Reader-21] [info] Mon Sep 10 20:12:23 2018, Reader: no message, Stats: 16/0/0, wait for 10 sec ... 
[2018-09-10 20:12:26.748] [elf::base::SharedMem-87] [info] Error: active_batch_size = 0, max_batch_size: 1, min_batch_size: 1, #msg count: 0
python3.6: /home/user/Downloads/ELF/src_cpp/elf/base/sharedmem.h:156: void elf::SharedMem::waitBatchFillMem(elf::Server*): Assertion `false' failed.

xphoniex avatar Sep 10 '18 16:09 xphoniex

can you please help with the sharedmem error? @jma127

xphoniex avatar Sep 14 '18 17:09 xphoniex

Ok. I will take a look at this issue. Did you change BOARDSIZE to be 9?

q_min minimal size of the replay buffer before starting training. q_max maximal size of the replay buffer (old experience will be discarded). Note that both numbers will be multiplied by 50 (num_readers) since there are 50 buffers for the sake of concurrency

yuandong-tian avatar Sep 14 '18 20:09 yuandong-tian

@yuandong-tian I inserted define BOARD9x9 1 at line 28 in src_cpp/elfgames/go/base/board.h and self-play is showing me 9x9 boards. I tried with 19x19 and I'm still getting the sharedmem error.

I have changed queue parameters in server:

--q_min_size 1     --q_max_size 2    --num_reader 2

this should make the server start training after 4 games, right?

in client I have added:

--suicide_after_n_games 16

which makes the client quit after 16 games and occasionally I don't even hit the sharedmem error on server side and I have to run client one more time. Why is this?

xphoniex avatar Sep 15 '18 04:09 xphoniex

As is explained in the previous comment this will set replay buffer size of min=50 and max=100

qucheng avatar Sep 17 '18 17:09 qucheng

@qucheng Are you saying min/max is hard-coded somewhere in the code? Because:

  • A) I've change Q_min/Q_max/num_readers
  • B) Why then am I hitting sharedmem error after 6-8 games? (usually)

Regardless my issue is not the minimum # of games required, it's the fact that my server hits sharedmem error and won't train.

xphoniex avatar Sep 17 '18 18:09 xphoniex

Can you try to set batchsize to 1, also you might try to increase #games a little. Batchsize needs to be much smaller than number of games due to concurrency (elf gather different selfplay games when they become available)

qucheng avatar Sep 17 '18 19:09 qucheng

I set batchize to 1, q_min to 5 (on both client and server) which is what I assume you meant by saying increase #games. (instead of num_games which increases the # of threads)

same problem. log.log:

Python version: 3.6.5 (default, May  3 2018, 10:08:28) 
[GCC 5.4.0 20160609]
PyTorch version: 0.4.1
CUDA version 9.2.148
Conda env: 
[2018-09-18 00:56:24.983] [rlpytorch.model_loader.load_env] [info] Loading env
<module 'elfgames.go.game' from '/home/user/ELF/src_py/elfgames/go/game.py'> elfgames.go.game
<module 'elfgames.go.df_model3' from '/home/user/ELF/src_py/elfgames/go/df_model3.py'> elfgames.go.df_model3
[2018-09-18 00:56:25.088] [rlpytorch.model_loader.load_env] [info] Parsed options: {'T': 1,
 'actor_only': False,
 'adam_eps': 0.001,
 'additional_labels': [],
 'backprop': True,
 'batchsize': 1,
 'batchsize2': -1,
 'black_use_policy_network_only': False,
 'bn': True,
 'bn_eps': 1e-05,
 'bn_momentum': 0.0,
 'cheat_eval_new_model_wins_half': False,
 'cheat_selfplay_random_result': False,
 'check_loaded_options': True,
 'client_max_delay_sec': 1200,
 'comment': '',
 'data_aug': -1,
 'dim': 224,
 'dist_rank': -1,
 'dist_url': '',
 'dist_world_size': -1,
 'dump_record_prefix': '',
 'epsilon': 0.0,
 'eval_model_pair': '',
 'eval_num_games': 0,
 'eval_old_model': -1,
 'eval_winrate_thres': 0.55,
 'expected_num_clients': 1,
 'following_pass': False,
 'freq_update': 1,
 'gpu': 0,
 'keep_prev_selfplay': True,
 'keys_in_reply': ['V'],
 'latest_symlink': 'latest',
 'leaky_relu': False,
 'list_files': [],
 'load': '',
 'load_model_sleep_interval': 0.0,
 'loglevel': 'info',
 'lr': 0.01,
 'mcts_alpha': 0.03,
 'mcts_epsilon': 0.25,
 'mcts_persistent_tree': True,
 'mcts_pick_method': 'most_visited',
 'mcts_puct': 0.85,
 'mcts_rollout_per_batch': 1,
 'mcts_rollout_per_thread': 1,
 'mcts_root_unexplored_q_zero': False,
 'mcts_threads': 4,
 'mcts_unexplored_q_zero': False,
 'mcts_use_prior': True,
 'mcts_verbose': False,
 'mcts_verbose_time': False,
 'mcts_virtual_loss': 5,
 'mode': 'train',
 'momentum': 0.9,
 'move_cutoff': -1,
 'num_block': 20,
 'num_cooldown': 50,
 'num_episode': 10,
 'num_future_actions': 1,
 'num_games': 1,
 'num_games_per_thread': -1,
 'num_minibatch': 2,
 'num_reader': 2,
 'num_reset_ranking': 5000,
 'omit_keys': [],
 'onload': [],
 'opt_method': 'sgd',
 'parameter_print': True,
 'parsed_args': ['./train.py',
                 '--mode',
                 'train',
                 '--num_reader',
                 '2',
                 '--batchsize',
                 '1',
                 '--num_games',
                 '1',
                 '--keys_in_reply',
                 'V',
                 '--T',
                 '1',
                 '--use_data_parallel',
                 '--num_minibatch',
                 '2',
                 '--num_episode',
                 '10',
                 '--mcts_threads',
                 '4',
                 '--mcts_rollout_per_thread',
                 '1',
                 '--keep_prev_selfplay',
                 '--keep_prev_selfplay',
                 '--use_mcts',
                 '--use_mcts_ai2',
                 '--mcts_persistent_tree',
                 '--mcts_use_prior',
                 '--mcts_virtual_loss',
                 '5',
                 '--mcts_epsilon',
                 '0.25',
                 '--mcts_alpha',
                 '0.03',
                 '--mcts_puct',
                 '0.85',
                 '--resign_thres',
                 '0.01',
                 '--gpu',
                 '0',
                 '--server_id',
                 'myserver',
                 '--eval_num_games',
                 '1',
                 '--eval_winrate_thres',
                 '0.55',
                 '--port',
                 '1234',
                 '--q_min_size',
                 '5',
                 '--q_max_size',
                 '20',
                 '--save_first',
                 '--num_block',
                 '20',
                 '--dim',
                 '224',
                 '--weight_decay',
                 '0.0002',
                 '--opt_method',
                 'sgd',
                 '--bn_momentum=0',
                 '--num_cooldown=50',
                 '--expected_num_client',
                 '1',
                 '--selfplay_init_num',
                 '0',
                 '--selfplay_update_num',
                 '0',
                 '--eval_num_games',
                 '0',
                 '--selfplay_async',
                 '--lr',
                 '0.01',
                 '--momentum',
                 '0.9'],
 'ply_pass_enabled': 0,
 'policy_distri_cutoff': 0,
 'policy_distri_training_for_all': False,
 'port': 1234,
 'preload_sgf': '',
 'preload_sgf_move_to': -1,
 'print_result': False,
 'q_max_size': 20,
 'q_min_size': 5,
 'ratio_pre_moves': 0,
 'record_dir': './record',
 'replace_prefix': [],
 'resign_thres': 0.01,
 'sample_nodes': ['pi,a'],
 'sample_policy': 'epsilon-greedy',
 'save_dir': './myserver',
 'save_first': True,
 'save_prefix': 'save',
 'selfplay_async': True,
 'selfplay_init_num': 0,
 'selfplay_timeout_usec': 0,
 'selfplay_update_num': 0,
 'server_addr': '',
 'server_id': 'myserver',
 'start_ratio_pre_moves': 0.5,
 'store_greedy': False,
 'suicide_after_n_games': -1,
 'tqdm': False,
 'trainer_stats': '',
 'use_data_parallel': True,
 'use_data_parallel_distributed': False,
 'use_df_feature': False,
 'use_fp16': False,
 'use_mcts': True,
 'use_mcts_ai2': True,
 'verbose': False,
 'weight_decay': 0.0002,
 'white_mcts_rollout_per_batch': -1,
 'white_mcts_rollout_per_thread': -1,
 'white_puct': -1.0,
 'white_use_policy_network_only': False}
Stats: Name  is not known!
[2018-09-18 00:56:25.091] [rlpytorch.model_loader.load_env] [info] Finished loading env
[2018-09-18 00:56:25.091] [elf::legacy::ContextOptions-0] [info] JobId: local
[2018-09-18 00:56:25.091] [elf::legacy::ContextOptions-0] [info] #Game: 1
[2018-09-18 00:56:25.091] [elf::legacy::ContextOptions-0] [info] T: 1
[2018-09-18 00:56:25.091] [elf::legacy::ContextOptions-0] [info] [#th=4][rl=1][per=1][eps=0.25][alpha=0.03][prior=1][c_puct=0.85][uqz=0][r_uqz=0]
[2018-09-18 00:56:25.091] [elfgames::go::train::TrainCtrl-11] [info] Finished initializing replay_buffer #Queue: 2, spec: ReaderQueue: Queue [min=5][max=20], Length: 0, 0, Total: 0, MinSizeSatisfied: 0
[2018-09-18 00:56:25.111] [elfgames::go::train::DataOnlineLoader-17] [info] ZMQVer: 4.2.3 Reader[db=data-1537215985.db] [local] Connect to [::1]:1234, ipv6: True, verbose: False
[2018-09-18 00:56:25.111] [elf::distributed::Reader-21] [info] Tue Sep 18 00:56:25 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ... 
**** Options ****
Seed: 0
Time signature: 180918-005625
Client max delay in sec: 1200
#FutureActions: 1
#GamePerThread: -1
mode: train
Selfplay init min #games: 0, update #games: 0, async: True
UseMCTS: True
Data Aug: -1
Start_ratio_pre_moves: 0.5
ratio_pre_moves: 0
MoveCutOff: -1
Use DF feature: False
PolicyDistriCutOff: 0
Expected #client: 1
Server_addr: [::1], server_id: myserver, port: 1234
#Reader: 2, Qmin_sz: 5, Qmax_sz: 20
Verbose: False
Policy distri training for all moves: False
Min Ply from which pass is enabled: 0
Reset move ranking after 5000 actions
Resign Threshold: 0.01, Dynamic Resign Threshold, resign_prob_never: 0.1, target_fp_rate: 0.05, bounded within [1e-09, 0.5]

Komi: 3.5

*****************
Version:  a39e7dcdd12208a2e068d80f352948407176b219_unstaged
Mode:  train
Num Actions:  82
train: {'input': ['s', 'offline_a', 'winner', 'mcts_scores', 'move_idx', 'selfplay_ver'], 'reply': None}
SharedMem: "train", keys: ['s', 'offline_a', 'move_idx', 'winner', 'mcts_scores', 'selfplay_ver']
s float [1, 18, 9, 9]
offline_a int64_t [1, 1]
move_idx int32_t [1]
winner float [1]
mcts_scores float [1, 82]
selfplay_ver int64_t [1]
s float [1, 18, 9, 9]
offline_a int64_t [1, 1]
move_idx int32_t [1]
winner float [1]
mcts_scores float [1, 82]
selfplay_ver int64_t [1]
train_ctrl: {'input': ['selfplay_ver'], 'reply': None, 'batchsize': 1}
SharedMem: "train_ctrl", keys: ['selfplay_ver']
selfplay_ver int64_t [1]
selfplay_ver int64_t [1]
[2018-09-18 00:56:27.860] [elfgames::go::train::ThreadedCtrl-13] [info] Setting init version: 0
[2018-09-18 00:56:27.860] [elfgames::go::train::EvalSubCtrl-15] [info] Set new baseline model, ver: 0
[2018-09-18 00:56:27.860] [elfgames::go::train::SelfPlaySubCtrl-14] [info] SelfPlay: -1 -> 0
Root: "./myserver"
Keep prev_selfplay: True
Save first: 
Save to ./myserver
Filename = ./myserver/save-0.bin
About to wait for sufficient selfplay
[2018-09-18 00:56:28.018] [elfgames::go::train::ThreadedCtrl-13] [info] Tue Sep 18 00:56:28 2018, Sufficient sample for model 0
[2018-09-18 00:56:35.111] [elf::distributed::Reader-21] [info] Tue Sep 18 00:56:35 2018 Ctrl from local-user-A-e738-d962-2541-f1b9[1]: 1537215985
[2018-09-18 00:56:35.112] [elfgames::go::train::TrainCtrl-11] [info] New allocated: local-user-A-e738-d962-2541-f1b9, Clients[1][#max_eval=-1][#max_th=1][#client_delay=1200], SelfplayOnly[1/100%], EvalThenSelfplay[0/0%]
[2018-09-18 00:56:35.112] [elf::distributed::Reader-21] [info] Tue Sep 18 00:56:35 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ... 
[2018-09-18 00:56:45.112] [elf::distributed::Reader-21] [info] Tue Sep 18 00:56:45 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ... 
[2018-09-18 00:56:55.112] [elf::distributed::Reader-21] [info] Tue Sep 18 00:56:55 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ... 
[2018-09-18 00:57:05.112] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:05 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ... 
[2018-09-18 00:57:15.112] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:15 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ... 
[2018-09-18 00:57:25.113] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:25 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ... 
[2018-09-18 00:57:35.113] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:35 2018, Reader: no message, Stats: 0/0/0, wait for 10 sec ... 
[2018-09-18 00:57:45.113] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:45 2018, Reader: no message, Stats: 1/0/0, wait for 10 sec ... 
[2018-09-18 00:57:55.177] [elf::distributed::Reader-21] [info] Tue Sep 18 00:57:55 2018, Reader: no message, Stats: 2/0/0, wait for 10 sec ... 
[2018-09-18 00:58:05.200] [elf::distributed::Reader-21] [info] Tue Sep 18 00:58:05 2018, Reader: no message, Stats: 3/0/0, wait for 10 sec ... 
[2018-09-18 00:58:15.200] [elf::distributed::Reader-21] [info] Tue Sep 18 00:58:15 2018, Reader: no message, Stats: 3/0/0, wait for 10 sec ... 
[2018-09-18 00:58:25.218] [elf::distributed::Reader-21] [info] Tue Sep 18 00:58:25 2018, Reader: no message, Stats: 4/0/0, wait for 10 sec ... 
[2018-09-18 00:58:28.028] [elf::base::SharedMem-87] [info] Error: active_batch_size = 0, max_batch_size: 1, min_batch_size: 1, #msg count: 0
python3: /home/user/ELF/src_cpp/elf/base/sharedmem.h:156: void elf::SharedMem::waitBatchFillMem(elf::Server*): Assertion `false' failed.

xphoniex avatar Sep 17 '18 20:09 xphoniex

sharedmem error is caused by /ELF/src_cpp/elf/comm/broadcast.h line 122:

      if ((int)(message.data.size() + data_count) > opt.batchsize) {
        unpop_msg(message);
        break;
      }

Data is being sent in batch size of 64 (I still don't know where 64 comes from) and it's bigger than our opt.batchsize which is 1. Setting the batchsize on server to 64 solved the issue. I don't think it should break here if data_count == 0.

I ran into another issue, client can't replace the new model:

[2018-09-20 20:02:57.430] [elfgames::go::common::DispatcherCallback-12] [info] Thu Sep 20 20:02:57 2018 Received actionable request: black_ver = 1, white_ver = -1, #addrs_to_reply: 1
In game start
[2018-09-20 20:02:57.875] [rlpytorch.model_loader.ModelLoader-0-model_index0] [info] Loading model from ./myserver/save-1.bin
[2018-09-20 20:02:57.875] [rlpytorch.model_loader.ModelLoader-0-model_index0] [info] replace_prefix for state dict: [['resnet.module', 'resnet'], ['init_conv.module', 'init_conv']]
[2018-09-20 20:02:58.023] [rlpytorch.model_loader.ModelLoader-0-model_index0] [info] Finished loading model from ./myserver/save-1.bin
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [64,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [65,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [66,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [67,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [68,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [69,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [70,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [71,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [72,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [73,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [74,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [75,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [76,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [77,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [78,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [79,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [80,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [81,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [0,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [1,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [2,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [3,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [4,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [5,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [6,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [7,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [8,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [9,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [10,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [11,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [12,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [13,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [14,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [15,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [16,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [17,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [18,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [19,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [20,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [21,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [22,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [23,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [24,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [25,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [26,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [27,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [28,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [29,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [30,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [31,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [32,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [33,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [34,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [35,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [36,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [37,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [38,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [39,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [40,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [41,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [42,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [43,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [44,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [45,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [46,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [47,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [48,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [49,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [50,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [51,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [52,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [53,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [54,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [55,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [56,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [57,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [58,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [59,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [60,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [61,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [62,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
/pytorch/aten/src/THC/THCTensorRandom.cuh:185: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = __half, AccT = float]: block: [0,0,0], thread: [63,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "./selfplay.py", line 202, in <module>
    main()
  File "./selfplay.py", line 196, in main
    GC.run()
  File "/home/user/ELF/src_py/elf/utils_elf.py", line 435, in run
    self._call(smem, *args, **kwargs)
  File "/home/user/ELF/src_py/elf/utils_elf.py", line 398, in _call
    reply = self._cb[idx](picked, *args, **kwargs)
  File "./selfplay.py", line 131, in <lambda>
    lambda batch, e=e, stat=stat: actor(batch, e, stat))
  File "./selfplay.py", line 126, in actor
    reply = e.actor(batch)
  File "/home/user/ELF/src_py/rlpytorch/trainer/trainer.py", line 101, in actor
    reply_msg = self.sampler.sample(state_curr)
  File "/home/user/ELF/src_py/rlpytorch/sampler/sampler.py", line 56, in sample
    actions[a_node] = sampler(state_curr, self.options, node=pi_node)
  File "/home/user/ELF/src_py/rlpytorch/sampler/sample_methods.py", line 125, in sample_multinomial
    return sample_eps_with_check(probs, args.epsilon, greedy=greedy)
  File "/home/user/ELF/src_py/rlpytorch/sampler/sample_methods.py", line 74, in sample_eps_with_check
    actions = sample_with_check(probs, greedy=greedy)
  File "/home/user/ELF/src_py/rlpytorch/sampler/sample_methods.py", line 43, in sample_with_check
    cond1 = (actions < 0).sum()
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generated/../THCReduceAll.cuh:317

xphoniex avatar Sep 20 '18 15:09 xphoniex

A polite reminder that this issue is still standing. Read my previous comment please. @qucheng @yuandong-tian

xphoniex avatar Sep 25 '18 15:09 xphoniex