alpha_zero_torch_example terminates on leduc_poker
./build/examples/alpha_zero_torch_example
--game=kuhn_poker
--path=/home/privateclient/alpha_zero_stuff/example/
Works fine
./build/examples/alpha_zero_torch_example
--game=leduc_poker
--path=/home/privateclient/alpha_zero_stuff/example/
--inference_cache=0 \
Works fine (but is slow ofc)
./build/examples/alpha_zero_torch_example
--game=leduc_poker
--path=/home/privateclient/alpha_zero_stuff/example/ \
Outputs: Playing game: leduc_poker Loading model from step 0 [W925 02:41:43.429698047 TensorCompare.cpp:615] Warning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (function operator()) terminate called recursively terminate called recursively terminate called after throwing an instance of 'c10::Error'
Likely some kind of hash collision issue relating to the cache that only shows in games with a high number of states
Hey! what is your operating system version? do you use cuda?
I'll try to reproduce and look into the problem.
Hmm... out of curiousity, does it work on the simple perfect information games, like e.g. Tic-Tac-Toe?
If so, I suspect it's because the default config is built for 2D convolutional style inputs which Leduc doesn't have.
If not, likely a deeper bug. That code is getting old now; we don't maintain it.. and I don't remember the last time someone reported running it. (E.g.it could be possible that there's an issue with newer versions of LibTorch)
I tested Tic-Tac-Toe, it is working fine, so it is likely due to the inputs of Leduc rather than any deeper bug.
Libtorch is:
OS is Ubuntu 20.04
thank you for the clarification, will check and report shortly
Was able to reproduce the issue on cpu with ./build/examples/alpha_zero_torch_example --game=leduc_poker --path=$HOME/lp:
terminate called after throwing an instance of 'c10::Error'
terminate called recursively
terminate called recursively
terminate called recursively
what(): Expected more than 1 value per channel when training, got input size [1, 128, 1, 1]
Exception raised from batch_norm at /home/alexa/open_spiel/open_spiel/libtorch/libtorch/include/torch/csrc/api/include/torch/nn/functional/batchnorm.h:35 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7a9c77c0812c in /home/alexa/open_spiel/open_spiel/libtorch/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7a9c77bb196c in /home/alexa/open_spiel/open_spiel/libtorch/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0x487a0e (0x5b21ef194a0e in ./build/examples/alpha_zero_torch_example)
frame #3: <unknown function> + 0x4873cb (0x5b21ef1943cb in ./build/examples/alpha_zero_torch_example)
frame #4: <unknown function> + 0x474505 (0x5b21ef181505 in ./build/examples/alpha_zero_torch_example)
frame #5: <unknown function> + 0x478d56 (0x5b21ef185d56 in ./build/examples/alpha_zero_torch_example)
frame #6: <unknown function> + 0x478548 (0x5b21ef185548 in ./build/examples/alpha_zero_torch_example)
frame #7: <unknown function> + 0x4921b0 (0x5b21ef19f1b0 in ./build/examples/alpha_zero_torch_example)
frame #8: <unknown function> + 0x48faa2 (0x5b21ef19caa2 in ./build/examples/alpha_zero_torch_example)
frame #9: <unknown function> + 0x489cb1 (0x5b21ef196cb1 in ./build/examples/alpha_zero_torch_example)
frame #10: <unknown function> + 0x489642 (0x5b21ef196642 in ./build/examples/alpha_zero_torch_example)
frame #11: <unknown function> + 0x3ffa0a (0x5b21ef10ca0a in ./build/examples/alpha_zero_torch_example)
frame #12: <unknown function> + 0x460f08 (0x5b21ef16df08 in ./build/examples/alpha_zero_torch_example)
frame #13: <unknown function> + 0x46215e (0x5b21ef16f15e in ./build/examples/alpha_zero_torch_example)
frame #14: <unknown function> + 0x469d89 (0x5b21ef176d89 in ./build/examples/alpha_zero_torch_example)
frame #15: <unknown function> + 0xecdb4 (0x7a9c5faecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #16: <unknown function> + 0x9caa4 (0x7a9c5f69caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #17: <unknown function> + 0x129c6c (0x7a9c5f729c6c in /lib/x86_64-linux-gnu/libc.so.6)
terminate called recursively
Aborted (core dumped)
Looks like evaluation ~~different observation~~ structure definetely breaks the execution. Will report back a bit later.
@Viren6, the problem was a source of a simple bug: during training, there could be batches with size of 1, which is not allowed by the implementation of BatchNorm2d.
This problem arises not from the library itself, but is caused by the fact that libtorch uses omp multithreading, that may spawn additional subthreads that tear up the batch.
re @lanctot if you know a quick fix
update: it's not necessary because of the interop multithreading, but because the ModuleImpls aren't thread-safe for the inference, and sometimes during training, there are erroneous samples from the rollouts or evaluation. I know that the device updates should be locked, but something doesn't look right.