ELF icon indicating copy to clipboard operation
ELF copied to clipboard

Possible problem with GPU configuration on experimental branch

Open snicolet opened this issue 6 years ago • 4 comments

Hi,

When working on a sub-branch of Olivier Teytaud's branch called "newtasks" (which uses the ELF framework for any abstract game), we stumbled on a possible GPU configuration error on run time after a successful compile.

Steps to reproduce:

• git checkout "tristan_breakthrough"
• source activate ELF
• make -j
• bash ./alltest.sh

Note that we forced the GPU number to be one by changing the line 53 of src_py/rlpytorch/model_loader.py to be "1" rather than the default "-1" : this was necessary to avoid a GPU run-time error in df_model3.py.

But now we get the following error about copying two tensors of different sizes in line 191 of utils_elf.py:

register actor_black for e = <rlpytorch.trainer.trainer.Evaluator object at 0x7efe27380b00>
register actor_white for e = <rlpytorch.trainer.trainer.Evaluator object at 0x7efe27380ac8>
Root: "./myserver"
In game start
No previous model loaded, loading from ./myserver
self.gpu =  0
bk:
tensor([1, 1, 1])
v.squeeze :
tensor([ 40, 125, 100, 128,  15, 117, 183,  23,  79, 183,  57, 166,  79,  59,
         51,  67, 157, 173, 109,  40,  60, 165, 174, 149, 183,  56,  14,  53,
        151, 169, 109, 179, 104,  12,  23, 138, 117, 115,  53, 177,  23,  26,
         68, 141, 173,  35, 155,  86,  59,  43,  59,  57,  58,  34,  99, 114,
        137,  22,  71, 139,  48, 103,  52, 173,  84,  40,  72,  30, 147, 163,
        102, 119, 161,  37,  44, 177,  85,  41, 174,   6,  43,  24, 160,   9,
        125,  69, 183, 151,   3,  36,  86,  38,  89, 182,  33,  38, 174, 176,
        147, 162,   2,  82,  66,   1, 110,  12,  32, 110,  56, 158,  31,  50,
         85, 122,  75,  82,  65,  77,  17, 112,  69,  96, 104, 188,  68,  90,
        142,  86, 156, 178, 144,   6, 150, 177,  12,   7, 116,  68,  42, 121,
        132,  58,  37, 169,  59,  50, 128,  19, 113, 120, 181, 109, 191,  74,
        146, 152,  68, 159, 127,  20,  40,  13, 134,  49,  66,  91, 170, 172,
         17, 158, 113, 118, 137, 120,  83,  38,  29, 157, 175, 142, 181, 112,
         80,  81, 126,  58,  62,  36,  63, 175,  45,  40], device='cuda:0')
> /home/snicolet/programmation/ELF/src_py/elf/utils_elf.py(191)copy_from()
-> for k, v in this_src.items():
(Pdb) 
Traceback (most recent call last):
  File "./selfplay.py", line 203, in <module>
    main()
  File "./selfplay.py", line 197, in main
    GC.run()
  File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 440, in run
    self._call(smem, *args, **kwargs)
  File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 408, in _call
    keys_extra, keys_missing = sel_reply.copy_from(reply)
  File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 191, in copy_from
    for k, v in this_src.items():
  File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 191, in copy_from
    for k, v in this_src.items():
  File "/home/snicolet/.conda/envs/ELF/lib/python3.6/bdb.py", line 51, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/snicolet/.conda/envs/ELF/lib/python3.6/bdb.py", line 70, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit

Would you have any idea what our error may be? Thanks in advance!

snicolet avatar Dec 17 '18 13:12 snicolet

The assignment bk[:] = v.squeeze() is not dimension-consistent so the try catch blocks fall into debug mode. See https://github.com/pytorch/ELF/blob/master/src_py/elf/utils_elf.py#L211

Could you print out the size of bk and size of v here?

yuandong-tian avatar Dec 17 '18 17:12 yuandong-tian

Hi Yuandong,

bk has size 2 and is equal to: tensor([1, 1]) v.squeeze has size 128 and is equal to: tensor([172, 75, 90, 177, 6, 147, 189, 71, 181, 165, 85, 69, 141, 27, 59, 25, 87, 104, 153, 161, 108, 129, 136, 174, 173, 54, 85, 177, 82, 138, 170, 3, 91, 187, 68, 30, 166, 15, 45, 47, 41, 48, 160, 89, 122, 106, 178, 190, 63, 103, 29, 174, 164, 48, 39, 12, 168, 35, 44, 115, 64, 12, 108, 138, 13, 98, 173, 6, 188, 57, 98, 180, 94, 163, 25, 49, 2, 135, 73, 88, 143, 111, 61, 172, 42, 164, 160, 138, 91, 0, 127, 94, 78, 64, 179, 2, 86, 92, 137, 47, 170, 161, 82, 188, 44, 56, 6, 16, 113, 185, 82, 51, 57, 189, 41, 40, 126, 10, 30, 175, 42, 15, 9, 173, 149, 147, 110, 180], device='cuda:0') In Breakthrough action values are between 0 and 192.

Thanks.

tristancazenave avatar Dec 17 '18 20:12 tristancazenave

Different runs give different result but the size of v.squeeze is 64 times the size of bk, and bk is always filled with ones.

tristancazenave avatar Dec 17 '18 22:12 tristancazenave

When you call  e.addField<int64_t>("a") somewhere in the code, make sure .addExtents has the correct size. E.g., in your case it should be

e.addField<int64_t>("a").addExtents(batchsize, {batchsize})

where batchsize = 128. If you called it with batchsize = 2 but sent a vector of dim=128, you will see this error.

yuandong-tian avatar Dec 17 '18 23:12 yuandong-tian