deeper-stacker icon indicating copy to clipboard operation
deeper-stacker copied to clipboard

please fix coding errors in pickup_best_model.lua script

Open garzy opened this issue 4 years ago • 14 comments

I have to add this at beggining:

require 'nn'
require 'cunn'

And

local path = arguments.model_path  
  path = path .. "NoLimit/"

because game_settings.nl is NULL

And replace this

 local best_model_path = path .. '/epoch_' .. epoch .. net_type_str .. '.model'

with this

 local best_model_path = path .. '/epoch_' .. best_epoch .. net_type_str .. '.model'

because NIL exception too.

Finally, it's crashing at line:

  torch.save(final_model_file_name, best_model)

error thrown:

 /home/torch/install/share/lua/5.2/torch/File.lua:136: attempt to call field 'insert' (a nil value)
stack traceback:
        /home/torch/install/share/lua/5.2/torch/File.lua:136: in function 'writeObject'
        /home/torch/install/share/lua/5.2/torch/File.lua:388: in function 'save'
        Training/pickup_best_model.lua:76: in function 'select_best_model'
        Training/pickup_best_model.lua:90: in main chunk
        [C]: in function 'dofile'
        /home/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk

I'm on kubuntu 16.04 with torch, lua 5.2, cutorch, cuda, and Nvidia GTX 1060 with 6GB of RAM

garzy avatar Oct 07 '20 06:10 garzy

Hi @garzy

Thanks for reporting this issue.

I have fixed the errors. Could you remove your local changes and update your local repository?

aikupoker avatar Oct 07 '20 08:10 aikupoker

I've updated the file and launch it again, but ends crashing at line

  torch.save(final_model_file_name, best_model)

Throwing the above exception

Maybe could be an error with return type of local best_model = torch.load(best_model_path) ??

garzy avatar Oct 07 '20 08:10 garzy

Could you print the complete log output?

aikupoker avatar Oct 07 '20 08:10 aikupoker

Selecting best model with less Validation Huber Loss ...
best epoch: 201
best loss: 0.076074071484905
best model path ../Data/Models/NoLimit/river//epoch_201_gpu.info
saving final model
/home/torch/install/bin/lua: /home/torch/install/share/lua/5.2/torch/File.lua:136: attempt to call field 'insert' (a nil value)
stack traceback:
        /home/torch/install/share/lua/5.2/torch/File.lua:136: in function 'writeObject'
        /home/torch/install/share/lua/5.2/torch/File.lua:388: in function 'save'
        Training/pickup_best_model.lua:83: in function 'select_best_model'
        Training/pickup_best_model.lua:97: in main chunk
        [C]: in function 'dofile'
        /home/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: in ?

garzy avatar Oct 07 '20 08:10 garzy

Please, update again your local repo "deeper-stacker" in master branch and try again.

Thanks!

aikupoker avatar Oct 07 '20 08:10 aikupoker

Same problem :(

best epoch: 201
best loss: 0.076074071484905
best model info path ../Data/Models/NoLimit/river/epoch_201_gpu.info
saving final model
/home/torch/install/bin/lua: /home/torch/install/share/lua/5.2/torch/File.lua:136: attempt to call field 'insert' (a nil value)
stack traceback:
        /home/torch/install/share/lua/5.2/torch/File.lua:136: in function 'writeObject'
        /home/torch/install/share/lua/5.2/torch/File.lua:388: in function 'save'
        Training/pickup_best_model.lua:85: in function 'select_best_model'
        Training/pickup_best_model.lua:104: in main chunk
        [C]: in function 'dofile'
        /home/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: in ?

garzy avatar Oct 07 '20 08:10 garzy

Could you do a ls -lah ../Data/Models/NoLimit/river/ to this path?

aikupoker avatar Oct 07 '20 09:10 aikupoker

...
-rw-rw-r-- 1 kml kml  119 oct  7 01:36 epoch_86_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:36 epoch_86_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:37 epoch_87_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:37 epoch_87_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:38 epoch_88_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:38 epoch_88_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:38 epoch_89_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:38 epoch_89_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 00:47 epoch_8_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 00:47 epoch_8_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:39 epoch_90_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:39 epoch_90_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:40 epoch_91_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:40 epoch_91_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:40 epoch_92_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:40 epoch_92_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:41 epoch_93_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:41 epoch_93_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:42 epoch_94_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:42 epoch_94_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:42 epoch_95_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:42 epoch_95_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:43 epoch_96_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:43 epoch_96_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:43 epoch_97_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:43 epoch_97_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:44 epoch_98_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:44 epoch_98_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 01:45 epoch_99_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 01:45 epoch_99_gpu.model
-rw-rw-r-- 1 kml kml  119 oct  7 00:47 epoch_9_gpu.info
-rw-rw-r-- 1 kml kml 112M oct  7 00:47 epoch_9_gpu.model
-rw-rw-r-- 1 kml kml    0 oct  7 10:56 final__gpu.model
-rw-rw-r-- 1 kml kml    8 oct  6 19:47 .gitkeep

garzy avatar Oct 07 '20 09:10 garzy

two underscores at final__gpu.model... maybe this?

garzy avatar Oct 07 '20 09:10 garzy

I fixed two typos in master branch. Try again.

aikupoker avatar Oct 07 '20 09:10 aikupoker

same problem

saving final model
/home/torch/install/bin/lua: /home/torch/install/share/lua/5.2/torch/File.lua:136: attempt to call field 'insert' (a nil value)
stack traceback:
        /home/torch/install/share/lua/5.2/torch/File.lua:136: in function 'writeObject'
        /home/torch/install/share/lua/5.2/torch/File.lua:388: in function 'save'
        Training/pickup_best_model.lua:85: in function 'select_best_model'
        Training/pickup_best_model.lua:104: in main chunk
        [C]: in function 'dofile'
        /home/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: in ?

Don't worry, maybe have corrupted training models because I generated them with kubuntu 18.04 but at the end I'm having segmentation fault core exceptions and trying to fix the problem I've noticed that I need kubuntu 16 instead, but in fresh install of kubuntu 16 I've run directly the step 4.th Training/main_train.lua 4

I'm going to retry operations from step 3.th Training/raw_converter.lua 4

garzy avatar Oct 07 '20 10:10 garzy

After repeat the steps I'm having the same error:

/deeper-stacker/Source$ th Training/pickup_best_model.lua 4
Selecting best model with less Validation Huber Loss ...
best epoch: 204 of total: 350 epochs
best loss: 0.074449650388494
best model info path ../Data/Models/NoLimit/river/epoch_204_gpu.info
saving final model
/home/torch/install/bin/lua: /home/torch/install/share/lua/5.2/torch/File.lua:136: attempt to call field 'insert' (a nil value)
stack traceback:
        /home/torch/install/share/lua/5.2/torch/File.lua:136: in function 'writeObject'
        /home/torch/install/share/lua/5.2/torch/File.lua:388: in function 'save'
        Training/pickup_best_model.lua:85: in function 'select_best_model'
        Training/pickup_best_model.lua:104: in main chunk
        [C]: in function 'dofile'
        /home/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: in ?

garzy avatar Oct 07 '20 17:10 garzy

I can continue doing this without launch pickup_best_model script

cp epoch_204_gpu.info final_gpu.info
cp epoth_204_gpu.model final_gpu.model

When I execute $ torch.load('final_cpu.info') model seems to load well.

Then, I continue with turn generation:

kml@kubuntu:~/deeper-stacker$ cd Source && th DataGeneration/main_data_generation.lua 3
Generating data ...
6sAh9s5c 1 292NN information:
learning_rate   0.0001
valid_loss      0.074449650388494
gpu     true
epoch   204
NN architecture:
nn.Sequential {
  [input -> (1) -> (2) -> (3) -> output]
  (1): nn.ConcatTable {
    input
      |`-> (1): nn.Sequential {
      |      [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> output]
      |      (1): nn.Linear(1009 -> 500)
      |      (2): nn.BatchNormalization (2D) (500)
      |      (3): nn.PReLU
      |      (4): nn.Linear(500 -> 500)
      |      (5): nn.BatchNormalization (2D) (500)
      |      (6): nn.PReLU
      |      (7): nn.Linear(500 -> 500)
      |      (8): nn.BatchNormalization (2D) (500)
      |      (9): nn.PReLU
      |      (10): nn.Linear(500 -> 1008)
      |    }
       `-> (2): nn.Sequential {
             [input -> (1) -> output]
             (1): nn.Narrow
           }
       ... -> output
  }
  (2): nn.ConcatTable {
    input
      |`-> (1): nn.Sequential {
      |      [input -> (1) -> output]
      |      (1): nn.SelectTable(1)
      |    }
       `-> (2): nn.Sequential {
             [input -> (1) -> (2) -> (3) -> output]
             (1): nn.DotProduct
             (2): nn.Replicate
             (3): nn.MulConstant
           }
       ... -> output
  }
  (3): nn.CAddTable
}
nextround init_bucket time: 1.1490240097046
    avgTime: 123.4568271637
AdAs8s8h 2 979nextround init_bucket time: 0.58787417411804
    avgTime: 73.452112078667
4hTdAd7h 3 1712nextround init_bucket time: 1.2796399593353
    avgTime: 57.244448343913
Th2d3s5c 4 14861nextround init_bucket time: 0.56205201148987
    avgTime: 44.798284769058
2s2hTsJc 5 100nextround init_bucket time: 0.64476418495178

garzy avatar Oct 07 '20 18:10 garzy

This error is weird. When I run th Training/pickup_best_model.lua 4 directly, the error occurs. When I debug the pickup_best_model.lua file in vs code, no error occurs and the final_gpu.model works fine. When I enter the torch environment and execute torch.save(final_model_file_name, best_model) manually, no error occurs. My environment is win10, luajit, cutorch.

yffbit avatar Feb 09 '21 12:02 yffbit