ELF icon indicating copy to clipboard operation
ELF copied to clipboard

failed to load the pretrained v2 model to run Go bot

Open hejin opened this issue 6 years ago • 12 comments

Hi guys,

I completely followed the project homepage instructions (all the software versions are strictly aligned) and tried to run the Go bot with the pretrained v2 model but failed with the msg: " RuntimeError: Error(s) in loading state_dict for Model_PolicyValue: Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias", "init_conv.1.running_mean", "init_conv.1.running_var". Unexpected key(s) in state_dict: "init_conv.module.0.weight", "init_conv.module.0.bias", "init_conv.module.1.weight", "init_conv.module.1.bias", "init_conv.module.1.running_mean", "init_conv.module.1.running_var", "init_conv.module.1.num_batches_tracked". "

The box is a 24 core x86-64 with a Nvidia GPU V100 / 16GB.

The full log is here and thanks much!

(base) roobot@ELF:~/play-ELF/ELF/scripts/elfgames/go$ ./run.sh /home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin Python version: 3.7.1 (default, Dec 14 2018, 19:28:38) [GCC 7.3.0] PyTorch version: 1.0.1.post2 CUDA version 10.0.130 Conda env: base [2019-02-16 22:29:30.383] [rlpytorch.model_loader.load_env0] [info] Loading env <module 'elfgames.go.game' from '/home/roobot/play-ELF/ELF/src_py/elfgames/go/game.py'> elfgames.go.game <module 'elfgames.go.df_model3' from '/home/roobot/play-ELF/ELF/src_py/elfgames/go/df_model3.py'> elfgames.go.df_model3 [2019-02-16 22:29:30.394] [rlpytorch.model_loader.load_env0] [info] Parsed options: {'T': 1, 'actor_only': False, 'adam_eps': 0.001, 'additional_labels': ['aug_code', 'move_idx'], 'batchsize': 16, 'batchsize2': -1, 'black_use_policy_network_only': False, 'bn': True, 'bn_eps': 1e-05, 'bn_momentum': 0.1, 'cheat_eval_new_model_wins_half': False, 'cheat_selfplay_random_result': False, 'check_loaded_options': False, 'client_max_delay_sec': 1200, 'comment': '', 'data_aug': -1, 'dim': 256, 'dist_rank': -1, 'dist_url': '', 'dist_world_size': -1, 'dump_record_prefix': '', 'epsilon': 0.0, 'eval_model_pair': '', 'eval_num_games': 400, 'eval_old_model': -1, 'eval_stats': '', 'eval_winrate_thres': 0.55, 'expected_num_clients': -1, 'following_pass': False, 'gpu': 0, 'greedy': True, 'keep_prev_selfplay': False, 'keys_in_reply': ['V', 'rv'], 'leaky_relu': False, 'list_files': [], 'load': '/home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin', 'load_model_sleep_interval': 0.0, 'loglevel': 'debug', 'lr': 0.001, 'mcts_alpha': 0.0, 'mcts_epsilon': 0.0, 'mcts_persistent_tree': True, 'mcts_pick_method': 'most_visited', 'mcts_puct': 1.5, 'mcts_rollout_per_batch': 16, 'mcts_rollout_per_thread': 8192, 'mcts_root_unexplored_q_zero': False, 'mcts_threads': 2, 'mcts_unexplored_q_zero': False, 'mcts_use_prior': True, 'mcts_verbose': False, 'mcts_verbose_time': True, 'mcts_virtual_loss': 1, 'mode': 'online', 'model': 'online', 'momentum': 0.9, 'move_cutoff': -1, 'multipred_backprop': True, 'num_block': 20, 'num_future_actions': 1, 'num_games': 1, 'num_games_per_thread': -1, 'num_minibatch': 5000, 'num_reader': 50, 'num_reset_ranking': 5000, 'omit_keys': [], 'onload': [], 'opt_method': 'adam', 'parameter_print': False, 'parsed_args': ['df_console.py', '--mode', 'online', '--keys_in_reply', 'V', 'rv', '--use_mcts', '--mcts_verbose_time', '--mcts_use_prior', '--mcts_persistent_tree', '--load', '/home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin', '--server_addr', 'localhost', '--port', '1234', '--replace_prefix', 'resnet.module,resnet', '--no_check_loaded_options', '--no_parameter_print', '--verbose', '--gpu', '0', '--num_block', '20', '--dim', '256', '--mcts_puct', '1.50', '--batchsize', '16', '--mcts_rollout_per_batch', '16', '--mcts_threads', '2', '--mcts_rollout_per_thread', '8192', '--resign_thres', '0.05', '--mcts_virtual_loss', '1', '--loglevel', 'debug'], 'ply_pass_enabled': 0, 'policy_distri_cutoff': 0, 'policy_distri_training_for_all': False, 'port': 1234, 'preload_sgf': '', 'preload_sgf_move_to': -1, 'print_result': False, 'q_max_size': 1000, 'q_min_size': 10, 'ratio_pre_moves': 0, 'replace_prefix': ['resnet.module,resnet'], 'resign_thres': 0.05, 'sample_nodes': ['pi,a'], 'sample_policy': 'epsilon-greedy', 'selfplay_async': False, 'selfplay_init_num': 2000, 'selfplay_timeout_usec': 0, 'selfplay_update_num': 1000, 'server_addr': 'localhost', 'server_id': '', 'start_ratio_pre_moves': 0.5, 'store_greedy': False, 'suicide_after_n_games': -1, 'use_data_parallel': False, 'use_data_parallel_distributed': False, 'use_df_feature': False, 'use_fp16': False, 'use_mcts': True, 'use_mcts_ai2': False, 'verbose': True, 'weight_decay': 0.0, 'white_mcts_rollout_per_batch': -1, 'white_mcts_rollout_per_thread': -1, 'white_puct': -1.0, 'white_use_policy_network_only': False} [2019-02-16 22:29:30.396] [rlpytorch.model_loader.load_env0] [info] Finished loading env [2019-02-16 22:29:30.397] [elf::base::ThreadedDispatcherT-11] [info] Wait all games[1] to register their mailbox human_actor: {'input': ['s', 'aug_code', 'move_idx'], 'reply': ['pi', 'a', 'V'], 'batchsize': 1} SharedMem: "human_actor", keys: ['a', 'V', 'pi', 's', 'aug_code', 'move_idx'] a int64_t [16] V float [16] pi float [16, 362] s float [16, 18, 19, 19] aug_code int32_t [16] move_idx int32_t [16] a int64_t [16] V float [16] pi float [16, 362] s float [16, 18, 19, 19] aug_code int32_t [16] move_idx int32_t [16] actor_black: {'input': ['s', 'aug_code', 'move_idx'], 'reply': ['pi', 'V', 'a', 'rv'], 'timeout_usec': 10, 'batchsize': 16} SharedMem: "actor_black", keys: ['a', 'V', 'rv', 'pi', 's', 'aug_code', 'move_idx'] a int64_t [16] V float [16] rv int64_t [16] pi float [16, 362] s float [16, 18, 19, 19] aug_code int32_t [16] move_idx int32_t [16] a int64_t [16] V float [16] rv int64_t [16] pi float [16, 362] s float [16, 18, 19, 19] aug_code int32_t [16] move_idx int32_t [16] [2019-02-16 22:29:34.512] [rlpytorch.model_loader.ModelLoader-1-model_indexNone] [info] Loading model from /home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin [2019-02-16 22:29:34.512] [rlpytorch.model_loader.ModelLoader-1-model_indexNone] [info] replace_prefix for state dict: [['resnet.module', 'resnet']] Traceback (most recent call last): File "df_console.py", line 87, in main() File "df_console.py", line 47, in main model = model_loader.load_model(GC.params) File "/home/roobot/play-ELF/ELF/src_py/rlpytorch/model_loader.py", line 161, in load_model check_loaded_options=self.options.check_loaded_options) File "/home/roobot/play-ELF/ELF/src_py/rlpytorch/model_base.py", line 147, in load self.load_state_dict(sd) File "/home/roobot/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Model_PolicyValue: Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias", "init_conv.1.running_mean", "init_conv.1.running_var". Unexpected key(s) in state_dict: "init_conv.module.0.weight", "init_conv.module.0.bias", "init_conv.module.1.weight", "init_conv.module.1.bias", "init_conv.module.1.running_mean", "init_conv.module.1.running_var", "init_conv.module.1.num_batches_tracked".

hejin avatar Feb 16 '19 15:02 hejin

https://github.com/pytorch/ELF/issues/133#issuecomment-463867827
i still have two errors not solved by using replace prefix

l1t1 avatar Feb 16 '19 23:02 l1t1

did you try the sever.sh and client.sh?

l1t1 avatar Feb 17 '19 00:02 l1t1

No :( I will try. Thanks much! @l1t1

hejin avatar Feb 17 '19 02:02 hejin

This is probably because of the version of PyTorch. A fix is on the way.

yuandong-tian avatar Feb 18 '19 06:02 yuandong-tian

@hejin @l1t1 what version of pytorch did you use? We use PyTorch 1.0.

yuandong-tian avatar Feb 18 '19 06:02 yuandong-tian

I use 1.0.1 with elf_convert.py too, but the windows binary df_console.exe shouldnt require pytorch installed by user

Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
1.0.1

l1t1 avatar Feb 18 '19 07:02 l1t1

suggest df_console.exe also support load elfv2.bin and train data such as 1500000.bin etc

l1t1 avatar Feb 18 '19 07:02 l1t1

Could you please try the newly-revised gtp.sh in master?

jma127 avatar Feb 20 '19 22:02 jma127

I download todays

D:\elfv2>\tool\wget -c https://dl.fbaipublicfiles.com/elfopengo/play/play_opengo_v2.zip
--2019-02-21 07:45:54--  https://dl.fbaipublicfiles.com/elfopengo/play/play_opengo_v2.zip
Length: 1076887016 (1.0G) [application/zip]
Saving to: 'play_opengo_v2.zip'

play_opengo_v2.zip            100%[=================================================>]   1.00G

2019-02-21 08:30:52 (391 KB/s) - 'play_opengo_v2.zip' saved [1076887016/1076887016]

and run the cpu version with buildin sabaki set engine to D:\elfv2\play_opengo_v2\elf_cpu_full\elf\df_console.exe it dosent work at all

○ newelfv2> name 
connection failed
○ newelfv2> version 
connection failed
○ newelfv2> protocol_version 
connection failed
○ newelfv2> list_commands 
connection failed
○ newelfv2> komi 6.5
connection failed
[5504] Failed to execute script df_console
Traceback (most recent call last):
  File "df_console.py", line 92, in <module>
  File "df_console.py", line 85, in main
  File "elf\utils_elf.py", line 435, in run
  File "elf\utils_elf.py", line 383, in _call
  File "elf\utils_elf.py", line 253, in cpu2gpu
  File "elf\utils_elf.py", line 253, in <dictcomp>
  File "site-packages\torch\cuda\__init__.py", line 161, in _lazy_init
  File "site-packages\torch\cuda\__init__.py", line 75, in _check_driver
AssertionError: Torch not compiled with CUDA enabled

l1t1 avatar Feb 21 '19 00:02 l1t1

but the gpu version works

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console

list_commands
= boardsize
clear_board
exit
final_score
genmove
komi
list_commands
name
play
protocol_version
quit
showboard
version

play b d16
=

genmove w
= N1

l1t1 avatar Feb 21 '19 01:02 l1t1

and the gpu version also support --load weights

D:\>fc /b D:\elfv2\play_opengo_v2\elf_gpu_full\elf\model-v2.bin d:\elfv2.bin |more
正在比较文件 D:\ELFV2\PLAY_OPENGO_V2\ELF_GPU_FULL\ELF\model-v2.bin 和 D:\ELFV2.BIN
FC: 找不到差异

some tests

quit
[2019-02-21 09:52:26.508] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 09:52:26.692] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 3 y = 15 move: dp please try a
gain
[2019-02-21 09:52:27.259] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 3 y = 15 move: dp please try a
gain
[2019-02-21 09:52:27.369] [elf::base::Context-3] [info] Stop all game threads ...
[2019-02-21 09:52:27.682] [elf::base::Context-3] [info] All games sent notification, Waiting until they join
[2019-02-21 09:52:27.684] [elf::base::Context-3] [info] Stop all collectors ...
[2019-02-21 09:52:27.687] [elf::base::Context-3] [info] Stop tmp pool...

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load d:/elfv2.bin
version
= 1.0

quit
[2019-02-21 09:55:16.300] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 09:55:16.301] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 0 y = 1 move: ab please try ag
ain
[2019-02-21 09:55:16.303] [elfgames::go::mcts::MCTSActor-21] [error] model version 1 and required version 1290000 are no
t consistent

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load d:/1500000.bin
genmove b
= D3


? Invalid input


? Invalid input


? Invalid input


? Invalid input


? Invalid input


? Invalid input

genmove w
= C16

quit
[2019-02-21 10:08:29.307] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 10:08:30.431] [elf::base::Context-3] [info] Stop all game threads ...
[2019-02-21 10:08:30.933] [elf::base::Context-3] [info] All games sent notification, Waiting until they join
[2019-02-21 10:08:30.937] [elf::base::Context-3] [info] Stop all collectors ...
[2019-02-21 10:08:30.957] [elf::base::Context-3] [info] Stop tmp pool...

l1t1 avatar Feb 21 '19 02:02 l1t1

test elf v1 weight

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load  d:/pretrained-go-19x19-v1.bin --num_block 20 --dim 224

? Invalid input


? Invalid input

genmove b
= Q16


? Invalid input

l1t1 avatar Feb 21 '19 02:02 l1t1