lua---nnx icon indicating copy to clipboard operation
lua---nnx copied to clipboard

SoftMaxTree...

Open elbamos opened this issue 9 years ago • 12 comments

I'm seeing:

/usr/local/share/lua/5.1/nnx/SoftMaxTree.lua:171: attempt to call field 'SoftMaxTree_updateOutput' (a nil value)
stack traceback:
    /usr/local/share/lua/5.1/nnx/SoftMaxTree.lua:171: in function 'func'
    /usr/local/share/lua/5.1/nngraph/gmodule.lua:252: in function 'neteval'
    /usr/local/share/lua/5.1/nngraph/gmodule.lua:287: in function 'forward'
    runtrain.lua:194: in function 'opfunc'

I have a suspicion it is related to cunnx declining to build:

/tmp/luarocks_cunnx-scm-1-1265/cunnx/SoftMaxTree.cu(439): error: argument of type "THCudaTensor *" is incompatible with parameter of type "THCudaIntTensor *"

/tmp/luarocks_cunnx-scm-1-1265/cunnx/BlockSparse.cu(99): error: argument of type "THCudaTensor *" is incompatible with parameter of type "THCudaLongTensor *"

/tmp/luarocks_cunnx-scm-1-1265/cunnx/BlockSparse.cu(100): error: argument of type "THCudaTensor *" is incompatible with parameter of type "THCudaLongTensor *"

/tmp/luarocks_cunnx-scm-1-1265/cunnx/WindowGate.cu(110): error: argument of type "THCudaTensor *" is incompatible with parameter of type "THCudaLongTensor *"

/tmp/luarocks_cunnx-scm-1-1265/cunnx/WindowGate2.cu(120): error: argument of type "THCudaTensor *" is incompatible with parameter of type "THCudaLongTensor *"

5 errors detected in the compilation of "/tmp/tmpxft_0000ff8c_00000000-7_init.cpp1.ii".
CMake Error at cunnx_generated_init.cu.o.cmake:262 (message):
  Error generating file
  /tmp/luarocks_cunnx-scm-1-1265/cunnx/build/CMakeFiles/cunnx.dir//./cunnx_generated_init.cu.o


CMakeFiles/cunnx.dir/build.make:63: recipe for target 'CMakeFiles/cunnx.dir/cunnx_generated_init.cu.o' failed
make[2]: *** [CMakeFiles/cunnx.dir/cunnx_generated_init.cu.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/cunnx.dir/all' failed
make[1]: *** [CMakeFiles/cunnx.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

Error: Build error: Failed building.

Could this be related to Mac OS X? I noticed that the nnx build script wants to generate .so's but not dylibs.

elbamos avatar Jan 12 '16 08:01 elbamos

fixed by https://github.com/nicholas-leonard/cunnx/pull/23 .

nicholas-leonard avatar Jan 12 '16 16:01 nicholas-leonard

Yes - fix confirmed, thanks!

elbamos avatar Jan 12 '16 16:01 elbamos

I spoke too soon. While it compiles now, it seems like there may be a memory leak. Varying batch size and whether to use L2 regularization, I get cuda out of memory before the 2d, 3d, or 4th epoch.

elbamos avatar Jan 12 '16 20:01 elbamos

@elbamos do you think it has anything to do with this : https://github.com/nicholas-leonard/cunnx/commit/9ebc12ba9e287efcfe08b877156780c090f5befd ? my cuda is rusty.

nicholas-leonard avatar Jan 12 '16 23:01 nicholas-leonard

I don't know - when it comes to cuda, I'm very much a user not a programmer, I'm afraid. I know just enough to say that if the amount of free GPU ram after epoch 2 is substantially less than after epoch 1, and nothing in the code changed other than to support the move to Tree, and dumping l2 regularization makes it last longer (because the parameters would have to be copied to be multiplied), that a memory leak is a suspect.

On Jan 12, 2016, at 6:03 PM, Nicholas Léonard [email protected] wrote:

@elbamos do you think it has anything to do with this : nicholas-leonard/cunnx@9ebc12b ? my cuda is rusty.

— Reply to this email directly or view it on GitHub.

elbamos avatar Jan 12 '16 23:01 elbamos

@elbamos It should be fixed with the newest commits. Please reinstall nnx and cunnx.

nicholas-leonard avatar Jan 19 '16 16:01 nicholas-leonard

@nicholas-leonard

'fraid not :( Installing the latest version, I get exactly the same result as before.

elbamos avatar Jan 20 '16 08:01 elbamos

@elbamos What script are you running? I could try to reproduce on my end.

nicholas-leonard avatar Jan 20 '16 15:01 nicholas-leonard

@nicholas-leonard Do you need the net design or the whole training script & data? How should I get it to you?

elbamos avatar Jan 20 '16 18:01 elbamos

@elbamos Whole training script and data. You can share your repository with me or send it to me via email.

nicholas-leonard avatar Jan 20 '16 23:01 nicholas-leonard

Has there been any update to this please? I installed the newer version of nnx and still getting the same above error.

abhisheksgumadi avatar Jul 05 '16 16:07 abhisheksgumadi

@abhisheksgumadi The unit test and compilations pass on my end (using Ubuntu 14). Could you be more explicit about your issue?

nicholas-leonard avatar Jul 06 '16 15:07 nicholas-leonard