ELF
ELF copied to clipboard
failed to load the pretrained v2 model to run Go bot
Hi guys,
I completely followed the project homepage instructions (all the software versions are strictly aligned) and tried to run the Go bot with the pretrained v2 model but failed with the msg: " RuntimeError: Error(s) in loading state_dict for Model_PolicyValue: Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias", "init_conv.1.running_mean", "init_conv.1.running_var". Unexpected key(s) in state_dict: "init_conv.module.0.weight", "init_conv.module.0.bias", "init_conv.module.1.weight", "init_conv.module.1.bias", "init_conv.module.1.running_mean", "init_conv.module.1.running_var", "init_conv.module.1.num_batches_tracked". "
The box is a 24 core x86-64 with a Nvidia GPU V100 / 16GB.
The full log is here and thanks much!
(base) roobot@ELF:~/play-ELF/ELF/scripts/elfgames/go$ ./run.sh /home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin
Python version: 3.7.1 (default, Dec 14 2018, 19:28:38)
[GCC 7.3.0]
PyTorch version: 1.0.1.post2
CUDA version 10.0.130
Conda env: base
[2019-02-16 22:29:30.383] [rlpytorch.model_loader.load_env0] [info] Loading env
<module 'elfgames.go.game' from '/home/roobot/play-ELF/ELF/src_py/elfgames/go/game.py'> elfgames.go.game
<module 'elfgames.go.df_model3' from '/home/roobot/play-ELF/ELF/src_py/elfgames/go/df_model3.py'> elfgames.go.df_model3
[2019-02-16 22:29:30.394] [rlpytorch.model_loader.load_env0] [info] Parsed options: {'T': 1,
'actor_only': False,
'adam_eps': 0.001,
'additional_labels': ['aug_code', 'move_idx'],
'batchsize': 16,
'batchsize2': -1,
'black_use_policy_network_only': False,
'bn': True,
'bn_eps': 1e-05,
'bn_momentum': 0.1,
'cheat_eval_new_model_wins_half': False,
'cheat_selfplay_random_result': False,
'check_loaded_options': False,
'client_max_delay_sec': 1200,
'comment': '',
'data_aug': -1,
'dim': 256,
'dist_rank': -1,
'dist_url': '',
'dist_world_size': -1,
'dump_record_prefix': '',
'epsilon': 0.0,
'eval_model_pair': '',
'eval_num_games': 400,
'eval_old_model': -1,
'eval_stats': '',
'eval_winrate_thres': 0.55,
'expected_num_clients': -1,
'following_pass': False,
'gpu': 0,
'greedy': True,
'keep_prev_selfplay': False,
'keys_in_reply': ['V', 'rv'],
'leaky_relu': False,
'list_files': [],
'load': '/home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin',
'load_model_sleep_interval': 0.0,
'loglevel': 'debug',
'lr': 0.001,
'mcts_alpha': 0.0,
'mcts_epsilon': 0.0,
'mcts_persistent_tree': True,
'mcts_pick_method': 'most_visited',
'mcts_puct': 1.5,
'mcts_rollout_per_batch': 16,
'mcts_rollout_per_thread': 8192,
'mcts_root_unexplored_q_zero': False,
'mcts_threads': 2,
'mcts_unexplored_q_zero': False,
'mcts_use_prior': True,
'mcts_verbose': False,
'mcts_verbose_time': True,
'mcts_virtual_loss': 1,
'mode': 'online',
'model': 'online',
'momentum': 0.9,
'move_cutoff': -1,
'multipred_backprop': True,
'num_block': 20,
'num_future_actions': 1,
'num_games': 1,
'num_games_per_thread': -1,
'num_minibatch': 5000,
'num_reader': 50,
'num_reset_ranking': 5000,
'omit_keys': [],
'onload': [],
'opt_method': 'adam',
'parameter_print': False,
'parsed_args': ['df_console.py',
'--mode',
'online',
'--keys_in_reply',
'V',
'rv',
'--use_mcts',
'--mcts_verbose_time',
'--mcts_use_prior',
'--mcts_persistent_tree',
'--load',
'/home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin',
'--server_addr',
'localhost',
'--port',
'1234',
'--replace_prefix',
'resnet.module,resnet',
'--no_check_loaded_options',
'--no_parameter_print',
'--verbose',
'--gpu',
'0',
'--num_block',
'20',
'--dim',
'256',
'--mcts_puct',
'1.50',
'--batchsize',
'16',
'--mcts_rollout_per_batch',
'16',
'--mcts_threads',
'2',
'--mcts_rollout_per_thread',
'8192',
'--resign_thres',
'0.05',
'--mcts_virtual_loss',
'1',
'--loglevel',
'debug'],
'ply_pass_enabled': 0,
'policy_distri_cutoff': 0,
'policy_distri_training_for_all': False,
'port': 1234,
'preload_sgf': '',
'preload_sgf_move_to': -1,
'print_result': False,
'q_max_size': 1000,
'q_min_size': 10,
'ratio_pre_moves': 0,
'replace_prefix': ['resnet.module,resnet'],
'resign_thres': 0.05,
'sample_nodes': ['pi,a'],
'sample_policy': 'epsilon-greedy',
'selfplay_async': False,
'selfplay_init_num': 2000,
'selfplay_timeout_usec': 0,
'selfplay_update_num': 1000,
'server_addr': 'localhost',
'server_id': '',
'start_ratio_pre_moves': 0.5,
'store_greedy': False,
'suicide_after_n_games': -1,
'use_data_parallel': False,
'use_data_parallel_distributed': False,
'use_df_feature': False,
'use_fp16': False,
'use_mcts': True,
'use_mcts_ai2': False,
'verbose': True,
'weight_decay': 0.0,
'white_mcts_rollout_per_batch': -1,
'white_mcts_rollout_per_thread': -1,
'white_puct': -1.0,
'white_use_policy_network_only': False}
[2019-02-16 22:29:30.396] [rlpytorch.model_loader.load_env0] [info] Finished loading env
[2019-02-16 22:29:30.397] [elf::base::ThreadedDispatcherT-11] [info] Wait all games[1] to register their mailbox
human_actor: {'input': ['s', 'aug_code', 'move_idx'], 'reply': ['pi', 'a', 'V'], 'batchsize': 1}
SharedMem: "human_actor", keys: ['a', 'V', 'pi', 's', 'aug_code', 'move_idx']
a int64_t [16]
V float [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
a int64_t [16]
V float [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
actor_black: {'input': ['s', 'aug_code', 'move_idx'], 'reply': ['pi', 'V', 'a', 'rv'], 'timeout_usec': 10, 'batchsize': 16}
SharedMem: "actor_black", keys: ['a', 'V', 'rv', 'pi', 's', 'aug_code', 'move_idx']
a int64_t [16]
V float [16]
rv int64_t [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
a int64_t [16]
V float [16]
rv int64_t [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
[2019-02-16 22:29:34.512] [rlpytorch.model_loader.ModelLoader-1-model_indexNone] [info] Loading model from /home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin
[2019-02-16 22:29:34.512] [rlpytorch.model_loader.ModelLoader-1-model_indexNone] [info] replace_prefix for state dict: [['resnet.module', 'resnet']]
Traceback (most recent call last):
File "df_console.py", line 87, in
https://github.com/pytorch/ELF/issues/133#issuecomment-463867827
i still have two errors not solved by using replace prefix
did you try the sever.sh and client.sh?
No :( I will try. Thanks much! @l1t1
This is probably because of the version of PyTorch. A fix is on the way.
@hejin @l1t1 what version of pytorch did you use? We use PyTorch 1.0.
I use 1.0.1 with elf_convert.py too, but the windows binary df_console.exe shouldnt require pytorch installed by user
Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
1.0.1
suggest df_console.exe also support load elfv2.bin and train data such as 1500000.bin etc
Could you please try the newly-revised gtp.sh in master?
I download todays
D:\elfv2>\tool\wget -c https://dl.fbaipublicfiles.com/elfopengo/play/play_opengo_v2.zip
--2019-02-21 07:45:54-- https://dl.fbaipublicfiles.com/elfopengo/play/play_opengo_v2.zip
Length: 1076887016 (1.0G) [application/zip]
Saving to: 'play_opengo_v2.zip'
play_opengo_v2.zip 100%[=================================================>] 1.00G
2019-02-21 08:30:52 (391 KB/s) - 'play_opengo_v2.zip' saved [1076887016/1076887016]
and run the cpu version with buildin sabaki set engine to D:\elfv2\play_opengo_v2\elf_cpu_full\elf\df_console.exe it dosent work at all
○ newelfv2> name
connection failed
○ newelfv2> version
connection failed
○ newelfv2> protocol_version
connection failed
○ newelfv2> list_commands
connection failed
○ newelfv2> komi 6.5
connection failed
[5504] Failed to execute script df_console
Traceback (most recent call last):
File "df_console.py", line 92, in <module>
File "df_console.py", line 85, in main
File "elf\utils_elf.py", line 435, in run
File "elf\utils_elf.py", line 383, in _call
File "elf\utils_elf.py", line 253, in cpu2gpu
File "elf\utils_elf.py", line 253, in <dictcomp>
File "site-packages\torch\cuda\__init__.py", line 161, in _lazy_init
File "site-packages\torch\cuda\__init__.py", line 75, in _check_driver
AssertionError: Torch not compiled with CUDA enabled
but the gpu version works
D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console
list_commands
= boardsize
clear_board
exit
final_score
genmove
komi
list_commands
name
play
protocol_version
quit
showboard
version
play b d16
=
genmove w
= N1
and the gpu version also support --load weights
D:\>fc /b D:\elfv2\play_opengo_v2\elf_gpu_full\elf\model-v2.bin d:\elfv2.bin |more
正在比较文件 D:\ELFV2\PLAY_OPENGO_V2\ELF_GPU_FULL\ELF\model-v2.bin 和 D:\ELFV2.BIN
FC: 找不到差异
some tests
quit
[2019-02-21 09:52:26.508] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 09:52:26.692] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 3 y = 15 move: dp please try a
gain
[2019-02-21 09:52:27.259] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 3 y = 15 move: dp please try a
gain
[2019-02-21 09:52:27.369] [elf::base::Context-3] [info] Stop all game threads ...
[2019-02-21 09:52:27.682] [elf::base::Context-3] [info] All games sent notification, Waiting until they join
[2019-02-21 09:52:27.684] [elf::base::Context-3] [info] Stop all collectors ...
[2019-02-21 09:52:27.687] [elf::base::Context-3] [info] Stop tmp pool...
D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load d:/elfv2.bin
version
= 1.0
quit
[2019-02-21 09:55:16.300] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 09:55:16.301] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 0 y = 1 move: ab please try ag
ain
[2019-02-21 09:55:16.303] [elfgames::go::mcts::MCTSActor-21] [error] model version 1 and required version 1290000 are no
t consistent
D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load d:/1500000.bin
genmove b
= D3
? Invalid input
? Invalid input
? Invalid input
? Invalid input
? Invalid input
? Invalid input
genmove w
= C16
quit
[2019-02-21 10:08:29.307] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 10:08:30.431] [elf::base::Context-3] [info] Stop all game threads ...
[2019-02-21 10:08:30.933] [elf::base::Context-3] [info] All games sent notification, Waiting until they join
[2019-02-21 10:08:30.937] [elf::base::Context-3] [info] Stop all collectors ...
[2019-02-21 10:08:30.957] [elf::base::Context-3] [info] Stop tmp pool...
test elf v1 weight
D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load d:/pretrained-go-19x19-v1.bin --num_block 20 --dim 224
? Invalid input
? Invalid input
genmove b
= Q16
? Invalid input