ELF
ELF copied to clipboard
Possible problem with GPU configuration on experimental branch
Hi,
When working on a sub-branch of Olivier Teytaud's branch called "newtasks" (which uses the ELF framework for any abstract game), we stumbled on a possible GPU configuration error on run time after a successful compile.
Steps to reproduce:
• git checkout "tristan_breakthrough"
• source activate ELF
• make -j
• bash ./alltest.sh
Note that we forced the GPU number to be one by changing the line 53 of src_py/rlpytorch/model_loader.py
to be "1" rather than the default "-1" : this was necessary to avoid a GPU run-time error in df_model3.py.
But now we get the following error about copying two tensors of different sizes in line 191 of utils_elf.py:
register actor_black for e = <rlpytorch.trainer.trainer.Evaluator object at 0x7efe27380b00>
register actor_white for e = <rlpytorch.trainer.trainer.Evaluator object at 0x7efe27380ac8>
Root: "./myserver"
In game start
No previous model loaded, loading from ./myserver
self.gpu = 0
bk:
tensor([1, 1, 1])
v.squeeze :
tensor([ 40, 125, 100, 128, 15, 117, 183, 23, 79, 183, 57, 166, 79, 59,
51, 67, 157, 173, 109, 40, 60, 165, 174, 149, 183, 56, 14, 53,
151, 169, 109, 179, 104, 12, 23, 138, 117, 115, 53, 177, 23, 26,
68, 141, 173, 35, 155, 86, 59, 43, 59, 57, 58, 34, 99, 114,
137, 22, 71, 139, 48, 103, 52, 173, 84, 40, 72, 30, 147, 163,
102, 119, 161, 37, 44, 177, 85, 41, 174, 6, 43, 24, 160, 9,
125, 69, 183, 151, 3, 36, 86, 38, 89, 182, 33, 38, 174, 176,
147, 162, 2, 82, 66, 1, 110, 12, 32, 110, 56, 158, 31, 50,
85, 122, 75, 82, 65, 77, 17, 112, 69, 96, 104, 188, 68, 90,
142, 86, 156, 178, 144, 6, 150, 177, 12, 7, 116, 68, 42, 121,
132, 58, 37, 169, 59, 50, 128, 19, 113, 120, 181, 109, 191, 74,
146, 152, 68, 159, 127, 20, 40, 13, 134, 49, 66, 91, 170, 172,
17, 158, 113, 118, 137, 120, 83, 38, 29, 157, 175, 142, 181, 112,
80, 81, 126, 58, 62, 36, 63, 175, 45, 40], device='cuda:0')
> /home/snicolet/programmation/ELF/src_py/elf/utils_elf.py(191)copy_from()
-> for k, v in this_src.items():
(Pdb)
Traceback (most recent call last):
File "./selfplay.py", line 203, in <module>
main()
File "./selfplay.py", line 197, in main
GC.run()
File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 440, in run
self._call(smem, *args, **kwargs)
File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 408, in _call
keys_extra, keys_missing = sel_reply.copy_from(reply)
File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 191, in copy_from
for k, v in this_src.items():
File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 191, in copy_from
for k, v in this_src.items():
File "/home/snicolet/.conda/envs/ELF/lib/python3.6/bdb.py", line 51, in trace_dispatch
return self.dispatch_line(frame)
File "/home/snicolet/.conda/envs/ELF/lib/python3.6/bdb.py", line 70, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
Would you have any idea what our error may be? Thanks in advance!
The assignment bk[:] = v.squeeze()
is not dimension-consistent so the try catch
blocks fall into debug mode.
See https://github.com/pytorch/ELF/blob/master/src_py/elf/utils_elf.py#L211
Could you print out the size of bk
and size of v
here?
Hi Yuandong,
bk has size 2 and is equal to: tensor([1, 1]) v.squeeze has size 128 and is equal to: tensor([172, 75, 90, 177, 6, 147, 189, 71, 181, 165, 85, 69, 141, 27, 59, 25, 87, 104, 153, 161, 108, 129, 136, 174, 173, 54, 85, 177, 82, 138, 170, 3, 91, 187, 68, 30, 166, 15, 45, 47, 41, 48, 160, 89, 122, 106, 178, 190, 63, 103, 29, 174, 164, 48, 39, 12, 168, 35, 44, 115, 64, 12, 108, 138, 13, 98, 173, 6, 188, 57, 98, 180, 94, 163, 25, 49, 2, 135, 73, 88, 143, 111, 61, 172, 42, 164, 160, 138, 91, 0, 127, 94, 78, 64, 179, 2, 86, 92, 137, 47, 170, 161, 82, 188, 44, 56, 6, 16, 113, 185, 82, 51, 57, 189, 41, 40, 126, 10, 30, 175, 42, 15, 9, 173, 149, 147, 110, 180], device='cuda:0') In Breakthrough action values are between 0 and 192.
Thanks.
Different runs give different result but the size of v.squeeze is 64 times the size of bk, and bk is always filled with ones.
When you call e.addField<int64_t>("a")
somewhere in the code, make sure .addExtents
has the correct size. E.g., in your case it should be
e.addField<int64_t>("a").addExtents(batchsize, {batchsize})
where batchsize = 128
. If you called it with batchsize = 2
but sent a vector of dim=128
, you will see this error.