AI-Feynman icon indicating copy to clipboard operation
AI-Feynman copied to clipboard

Problem with using pretrained NN

Open nguyensu opened this issue 4 years ago • 8 comments

Hi Silviu:

Thanks for this great package. I have been using it for my research and has a problem with using pretrained NNs, The error is as follows:

RuntimeError: Error(s) in loading state_dict for SimpleNet: size mismatch for linear1.weight: copying a param with shape torch.Size([128, 4]) from checkpoint, the shape in current model is torch.Size([128, 3]).

Noted that my dataset has 4 independent variables. From the log, it seems AI-Feyman has reduced the number of variables from 4 to 3 in the previous steps, may cause this issue. Just want to check what is the best way to fix this.

Thanks

Su

nguyensu avatar Jul 17 '20 02:07 nguyensu

Dear Su,

Thank you for your interest in our code. Could you please show me the output of the code a while before this error message, it might help me understand easier what is going on.

SJ001 avatar Jul 17 '20 17:07 SJ001

Hi,

I have been also having the same issue with my own data: RuntimeError: Error(s) in loading state_dict for SimpleNet: size mismatch for linear1.weight: copying a param with shape torch.Size([128, 9]) from checkpoint, the shape in current model is torch.Size([128, 8]). I have 10 independent variables in my case.

This is my entire error message:

'' NN already trained

NN loss: tensor(0.4381, grad_fn=<DivBackward0>)

Checking for symmetry Data_Values.txt_train-translated_divide NN already trained

NN loss: tensor(nan, grad_fn=<DivBackward0>)

Checking for symmetry Data_Values.txt_train-translated_divide-translated_plus Found pretrained NN


RuntimeError Traceback (most recent call last) in () 1 from S_run_aifeynman import run_aifeynman 2 ----> 3 run_aifeynman("/content/AI-Feynman/neural_data/","Data_Values.txt",30,"19ops.txt", polyfit_deg=3) /content/AI-Feynman/Code/S_run_aifeynman.py in run_aifeynman(pathdir, filename, BF_try_time, BF_ops_file_type, polyfit_deg, NN_epochs, vars_name, test_percentage) 163 PA = ParetoSet() 164 # Run the code on the train data --> 165 PA = run_AI_all(pathdir,filename+"_train",BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA=PA) 166 PA_list = PA.get_pareto_points() 167

/content/AI-Feynman/Code/S_run_aifeynman.py in run_AI_all(pathdir, filename, BF_try_time, BF_ops_file_type, polyfit_deg, NN_epochs, PA) 110 new_pathdir, new_filename = do_translational_symmetry_divide(pathdir,filename,symmetry_divide_result[1],symmetry_divide_result[2]) 111 PA1_ = ParetoSet() --> 112 PA1 = run_AI_all(new_pathdir,new_filename,BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA1_) 113 PA = add_sym_on_pareto(pathdir,filename,PA1,symmetry_divide_result[1],symmetry_divide_result[2],PA,"/") 114 return PA

/content/AI-Feynman/Code/S_run_aifeynman.py in run_AI_all(pathdir, filename, BF_try_time, BF_ops_file_type, polyfit_deg, NN_epochs, PA) 89 new_pathdir, new_filename = do_translational_symmetry_plus(pathdir,filename,symmetry_plus_result[1],symmetry_plus_result[2]) 90 PA1_ = ParetoSet() ---> 91 PA1 = run_AI_all(new_pathdir,new_filename,BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA1_) 92 PA = add_sym_on_pareto(pathdir,filename,PA1,symmetry_plus_result[1],symmetry_plus_result[2],PA,"+") 93 return PA

/content/AI-Feynman/Code/S_run_aifeynman.py in run_AI_all(pathdir, filename, BF_try_time, BF_ops_file_type, polyfit_deg, NN_epochs, PA) 61 elif path.exists("results/NN_trained_models/models/" + filename + "_pretrained.h5"): 62 print("Found pretrained NN \n") ---> 63 NN_train(pathdir,filename,NN_epochs/2,lrs=1e-3,N_red_lr=3,pretrained_path="results/NN_trained_models/models/" + filename + "_pretrained.h5") 64 print("NN loss after training: ", NN_eval(pathdir,filename), "\n") 65 else:

/content/AI-Feynman/Code/S_NN_train.py in NN_train(pathdir, filename, epochs, lrs, N_red_lr, pretrained_path) 114 115 if pretrained_path!="": --> 116 model_feynman.load_state_dict(torch.load(pretrained_path)) 117 118 check_es_loss = 10000

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict) 845 if len(error_msgs) > 0: 846 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( --> 847 self.class.name, "\n\t".join(error_msgs))) 848 return _IncompatibleKeys(missing_keys, unexpected_keys) 849

RuntimeError: Error(s) in loading state_dict for SimpleNet: size mismatch for linear1.weight: copying a param with shape torch.Size([128, 9]) from checkpoint, the shape in current model is torch.Size([128, 8]). ''

MatthewChen37 avatar Jul 17 '20 22:07 MatthewChen37

Hi Silviu:

My problem is similar to Matthew:

....

Checking for brute force +

Trying to solve mysteries with brute force... Trying to solve results/mystery_world_sqrt/eoq_data.txt_train-translated_plus... /bin/cp -p results/mystery_world_sqrt/eoq_data.txt_train-translated_plus mystery.dat Number of variables..... 3 Functions used.......... +-/><~\RPSCLE Arity 0 : Pabc Arity 1 : ><~\RSCLE Arity 2 : +-/ Loading mystery data.... 1000 rows read from file mystery.dat
Number of examples...... 1000 Mystery data has largest magnitude 96.744442555828869 at j= 258 Searching for best fit... 666.000000000000 93.602849814816 P 1 763.2601 27.0269 60.2462 1878.5495 666.000000000000 93.744442555829 c 4 733.8841 28.9702 62.1894 1878.7589 666.000000000000 36.329212687856 aR 22 536.4852 31.0740 64.2933 1866.9244 666.000000000000 33.498889352461 bR 23 495.7903 31.0285 64.2478 1863.3896 666.000000000000 33.506795540779 b<R 255 495.7848 34.4993 67.7186 1866.8599 666.000000000000 30.498889352461 cbR+ 504 465.9682 35.3932 68.6124 1865.0132 666.000000000000 72.136995216099 caL* 692 445.3115 35.7402 68.9595 1863.4018 666.000000000000 71.862293635523 cbL* 696 438.0313 35.7342 68.9534 1862.6581 666.000000000000 7.301723455837 bb+R 1271 397.2188 36.5494 69.7687 1859.0650 666.000000000000 -10.338765766932 aPR 1278 382.1004 36.5212 69.7405 1857.3026 666.000000000000 -15.355383331861 bPR 1279 325.5215 36.3115 69.5308 1849.9926 636.385378826349 -31.965972578496 bP>R 13731 286.8809 39.6685 72.8878 1847.6521 633.635321990147 33.420396146597 acLR 13854 324.9232 39.6752 72.8944 1853.3458 546.948122255043 30.453801024219 bcLR 13855 263.2369 39.4630 72.6823 1843.7409 511.293663817382 -58.281956708123 Pba+R 87277 190.8319 42.0210 75.2402 1831.7213 509.110778573857 62.003271579413 bRcRL 355475 270.4776 44.0409 77.2601 1849.6602 498.535075165671 27.453801024219 cbcLR+ 969012 240.0173 45.4573 78.6766 1845.6561 441.063158810445 7.053589317205 abcL*+R 1079602 198.5730 45.4365 78.6558 1837.1642 217.569639622967 15.392387614913 cba**RR 1364564 121.4374 44.7550 77.9743 1815.0671 217.565900310493 15.397960246034 cba<**RR 16990232 121.4347 48.3932 81.6124 1818.7043 217.565881427321 15.397472595089 cab<RR 16990244 121.4346 48.3932 81.6124 1818.7043 217.563899354636 15.398533973346 cbaR<R 21129688 121.4324 48.7077 81.9270 1819.0180 Checking for brute force *

Trying to solve mysteries with brute force... Trying to solve results/mystery_world_sqrt/eoq_data.txt_train-translated_plus... /bin/cp -p results/mystery_world_sqrt/eoq_data.txt_train-translated_plus mystery.dat Number of variables..... 3 Functions used.......... +-/><~\RPSCLE Arity 0 : Pabc Arity 1 : ><~\RSCLE Arity 2 : +-/ Loading mystery data.... 1000 rows read from file mystery.dat
Number of examples...... 1000 Mystery data has largest magnitude 127750.00000000000 at j= 1 Searching for best fit... 666.000000000000 100.039859017533 P 1 566.0325 26.7770 59.9963 1864.9109 666.000000000000 0.002609252760 a 2 609.7040 27.6532 60.8724 1869.3016 666.000000000000 17.460249716745 c 4 382.9163 28.2411 61.4604 1849.0802 666.000000000000 16.541289205337 c> 8 385.6423 29.2260 62.4452 1850.4039 666.000000000000 14.865696201390 cP+ 44 393.1367 31.6787 64.8980 1853.7414 527.806625474009 0.002561405826 ba+ 47 302.8501 31.2081 64.4274 1841.9329 527.804082835278 0.002561426702 ba<+ 451 302.8485 34.4705 67.6898 1845.1950 527.744822619888 0.002561030125 cba++ 5100 302.8294 37.9696 71.1889 1848.6914 527.286122070359 2.449489742783 baRR 20071 174.6039 39.9449 73.1642 1825.5463 527.107880112961 0.002560416004 babR++ 64763 302.3497 41.6345 74.8538 1852.2857 496.799142612467 0.002554625922 abPL+ 65502 272.1302 41.5654 74.7847 1847.4979 420.325290327500 0.000603729143 ba+cR* 254043 182.1380 43.2797 76.4990 1831.1354 403.349065384127 2.148129696752 cbaRR+ 1151316 163.9800 45.4004 78.6197 1828.5243 329.359877029521 0.063834549508 acbR*R 1264230 121.0946 45.2430 78.4623 1814.8279 0.000000000000 1.189207115003 cba**RR 1364564 0.0000 20.3800 28.2089 220.9229 All done: results in results.dat
Checking polyfit

Complexity RMSE Expression [0.0, 33.72515523232843, -1.50374224914926e-759802] [15.509775004326936, 29.04603252368458, 'asin(0.000000000012*(x1exp(exp(exp(sin(log(x1)))))))'] [18.509775004326936, 29.044458005885136, 'asin(0.000000000091(exp((cos((x1+1)))(-1)))(-1))'] [21.094737505048094, 1.7551041143752824, '0.000000000000+sqrt((x2*(x1*(x0+x0))))'] Checking for brute force +

Trying to solve mysteries with brute force... Trying to solve results/mystery_world_squared/eoq_data.txt_train-translated_plus... /bin/cp -p results/mystery_world_squared/eoq_data.txt_train-translated_plus mystery.dat Number of variables..... 3 Functions used.......... +-/><~\RPSCLE Arity 0 : Pabc Arity 1 : ><~\RSCLE Arity 2 : +-/ Loading mystery data.... 1000 rows read from file mystery.dat
Number of examples...... 1000 Mystery data has largest magnitude 127750.00000000000 at j= 1 Searching for best fit... 0.000976562500 0.000001907349 cbaa+** 864596 0.0001 26.3313 59.5506 1193.2193 Checking for brute force *

Trying to solve mysteries with brute force... Trying to solve results/mystery_world_squared/eoq_data.txt_train-translated_plus... /bin/cp -p results/mystery_world_squared/eoq_data.txt_train-translated_plus mystery.dat Number of variables..... 3 Functions used.......... +-/><~\RPSCLE Arity 0 : Pabc Arity 1 : ><~\RSCLE Arity 2 : +-/ Loading mystery data.... 1000 rows read from file mystery.dat
Number of examples...... 1000 Mystery data has largest magnitude 4596430400000.0000 at j= 660 Searching for best fit... 0.000976562500 2.000000000000 cba** 5420 0.0001 19.0137 52.2330 1185.9016 Checking polyfit

Complexity RMSE Expression [0.0, 33.72515523232843, -1.50374224914926e-759802] [15.509775004326936, 29.04603252368458, 'asin(0.000000000012*(x1exp(exp(exp(sin(log(x1)))))))'] [18.509775004326936, 29.044458005885136, 'asin(0.000000000091(exp((cos((x1+1)))(-1)))(-1))'] [21.094737505048094, 1.7551041143752824, '0.000000000000+sqrt((x2*(x1*(x0+x0))))'] Checking for brute force +

Trying to solve mysteries with brute force... Trying to solve results/mystery_world_tan/eoq_data.txt_train-translated_plus... /bin/cp -p results/mystery_world_tan/eoq_data.txt_train-translated_plus mystery.dat Number of variables..... 3 Functions used.......... +-/><~\RPSCLE Arity 0 : Pabc Arity 1 : ><~\RSCLE Arity 2 : +-/ Loading mystery data.... 1000 rows read from file mystery.dat
Number of examples...... 1000 Mystery data has largest magnitude 1.5030748346316705E-003 at j= 942 Searching for best fit... 666.000000000000 -3.143095815847 P 1 166.1562 28.9500 62.1693 1808.9909 666.000000000000 37.998496925165 c~ 16 166.7884 32.9472 66.1665 1813.1642 666.000000000000 -489.388875204521 bR 23 252.7203 33.4397 66.6590 1832.6463 666.000000000000 -489.387853517918 b<R 255 252.7214 36.9105 70.1298 1836.1173 666.000000000000 666.000000000000 cc~* 620 1113.5803 37.9645 71.1838 1905.0587 666.000000000000 666.000000000000 cc~<* 7584 1130.9781 41.5738 74.7931 1909.3786 666.000000000000 -666.000000000000 bP>E/ 11139 2138.6914 41.9623 75.1816 1938.9995 666.000000000000 -427.957730731187 cccS** 67732 1197.7314 44.4707 77.6900 1915.1537 666.000000000000 -1.021180233438 ca+RS\ 296860 51.8851 44.7189 77.9381 1774.0710 Checking for brute force *

Trying to solve mysteries with brute force... Trying to solve results/mystery_world_tan/eoq_data.txt_train-translated_plus... /bin/cp -p results/mystery_world_tan/eoq_data.txt_train-translated_plus mystery.dat Number of variables..... 3 Functions used.......... +-/><~\RPSCLE Arity 0 : Pabc Arity 1 : ><~\RSCLE Arity 2 : +-/ Loading mystery data.... 1000 rows read from file mystery.dat
Number of examples...... 1000 Mystery data has largest magnitude 127750.00000000000 at j= 1 Searching for best fit... 666.000000000000 -0.022601610502 P 1 166.1535 28.9500 62.1692 1808.9902 666.000000000000 -0.000000589498 a 2 166.1523 29.9499 63.1692 1809.9899 666.000000000000 -0.000031557802 b 3 166.0455 30.5337 63.7530 1810.5455 666.000000000000 -0.000031571834 b< 11 166.0455 32.4082 65.6275 1812.4200 666.000000000000 -0.000000000262 ba* 63 165.9667 34.9254 68.1447 1814.9162 666.000000000000 -0.000000014026 bb* 67 425.0213 34.9417 68.1610 1857.9057 666.000000000000 -0.000000014032 bb<* 599 425.1815 38.1020 71.3213 1861.0833 666.000000000000 -0.000000000000 bba** 5419 388.2227 41.2396 74.4589 1860.1119 666.000000000000 -0.000000000779 cbb** 5436 790.8292 41.1509 74.3702 1892.5766 666.000000000000 666.000000000000 bS>S< 42783 748.3024 43.9908 77.2101 1893.0313 666.000000000000 666.000000000000 bb>*C< 212655 1174.2245 46.0924 79.3117 1915.9000 Checking polyfit

Complexity RMSE Expression [0.0, 30.05749571799299, 'atan(-0.000000000000*(x1*(x1x0)))'] [15.509775004326936, 29.04603252368458, 'asin(0.000000000012(x1exp(exp(exp(sin(log(x1)))))))'] [18.509775004326936, 29.044458005885136, 'asin(0.000000000091(exp((cos((x1+1)))(-1)))(-1))'] [21.094737505048094, 1.7551041143752824, '0.000000000000+sqrt((x2*(x1*(x0+x0))))'] Checking for symmetry eoq_data.txt_train-translated_plus Found pretrained NN

Traceback (most recent call last): File "inventory/inventory_learn.py", line 70, in # vars_name=["D","K","H","Q"] File "AI-Feynman/Code/S_run_aifeynman.py", line 165, in run_aifeynman PA = run_AI_all(pathdir,filename+"train",BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA=PA) File "AI-Feynman/Code/S_run_aifeynman.py", line 91, in run_AI_all PA1 = run_AI_all(new_pathdir,new_filename,BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA1) File "AI-Feynman/Code/S_run_aifeynman.py", line 63, in run_AI_all NN_train(pathdir,filename,NN_epochs/2,lrs=1e-3,N_red_lr=3,pretrained_path="results/NN_trained_models/models/" + filename + "_pretrained.h5") File "AI-Feynman/Code/S_NN_train.py", line 116, in NN_train model_feynman.load_state_dict(torch.load(pretrained_path)) File "/usr/bin/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for SimpleNet: size mismatch for linear1.weight: copying a param with shape torch.Size([128, 4]) from checkpoint, the shape in current model is torch.Size([128, 3]).

Process finished with exit code 1

nguyensu avatar Jul 20 '20 05:07 nguyensu

Hey Silviu, facing the same exact problem when running on my own data.

  File "birdseyefeyn.py", line 6, in <module>
    run_aifeynman("example_data/",'debuglight3_hitincremented_only_small.txt',30,"14ops.txt", polyfit_deg=3, NN_epochs=400)
  File "/Volumes/Transcend/ai_feynman/AI-Feynman/feynman/S_run_aifeynman.py", line 169, in run_aifeynman
    PA = run_AI_all(pathdir,filename+"_train",BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA=PA)
  File "/Volumes/Transcend/ai_feynman/AI-Feynman/feynman/S_run_aifeynman.py", line 94, in run_AI_all
    PA1 = run_AI_all(new_pathdir,new_filename,BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA1_)
  File "/Volumes/Transcend/ai_feynman/AI-Feynman/feynman/S_run_aifeynman.py", line 66, in run_AI_all
    NN_train(pathdir,filename,NN_epochs/2,lrs=1e-3,N_red_lr=3,pretrained_path="results/NN_trained_models/models/" + filename + "_pretrained.h5")
  File "/Volumes/Transcend/ai_feynman/AI-Feynman/feynman/S_NN_train.py", line 133, in NN_train
    model_feynman.load_state_dict(torch.load(pretrained_path))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SimpleNet:
	size mismatch for linear1.weight: copying a param with shape torch.Size([128, 4]) from checkpoint, the shape in current model is torch.Size([128, 3]).

The variables are exactly the same number of variables in example1 which is what is very confusing to me.

I also tried editing linear1.weight to subtract 1 to get 3 (instead of 4) and discovered that it's actually the same shape which makes this error even more confusing.

youssefavx avatar Feb 15 '21 18:02 youssefavx

example1.txt works fine, it's only when I use my own dataset with the same number of variables that it gives that error.

youssefavx avatar Feb 15 '21 18:02 youssefavx

Just to be even clearer, here's a sample of what my dataset looks like:

6382179.0000000000000000 1.0000000000000000 1000.0000000000000000 1.0000000000000000 26.0000000000000000 
6382179.0000000000000000 1.0000000000000000 1000.0000000000000000 2.0000000000000000 30.0000000000000000 
6382179.0000000000000000 1.0000000000000000 1000.0000000000000000 3.0000000000000000 46.0000000000000000 
6382179.0000000000000000 1.0000000000000000 1000.0000000000000000 4.0000000000000000 77.0000000000000000

Even though these are supposed to be integers, I tried hard to match everything as much as possible to example1.

The last thing I'm thinking of trying is to normalize all these numbers to maybe something under 10 which is not something I want to do but it might be necessary to actually get an answer.

I really tried hard to find this bug and I still can't find it.

  1. I checked model_feynman before the state_dict is loaded, and linear1.weight is set to 3, and when I checked the model being loaded it's also set to 3 and yet I still get this same error.
RuntimeError: Error(s) in loading state_dict for SimpleNet:
	size mismatch for linear1.weight: copying a param with shape torch.Size([128, 4]) from checkpoint, the shape in current model is torch.Size([128, 3]).
  1. I tried subtracting and adding 1 to the weight, still the same error.
model_feynman.linear1.in_features= (model_feynman.linear1.in_features - 1)
  1. I tried loading in the model using torch.load but of course that didn't work because it loads an OrderedDict.

  2. I tried doing:

model_feynman = SimpleNet(n_variables)

again before loading the model, and still the same error.

And somehow example1.txt still works.

  1. I tried renaming my file to example1...still not working. I thought maybe some default settings were set for that filename.

If anybody solves this problem, please help!

youssefavx avatar Feb 16 '21 10:02 youssefavx

In the first variable, I'm trying to represent a string abc numerically by converting from string to binary to int. Not sure if that's the best way to do it for a problem like this but anyway I'm experimenting.

youssefavx avatar Feb 16 '21 10:02 youssefavx

I get the same error with the current branch:

Error(s) in loading state_dict for SimpleNet:
	size mismatch for linear1.weight: copying a param with shape torch.Size([128, 3]) from checkpoint, the shape in current model is torch.Size([128, 2]).

This branch does not get the error running on Google Colab with the TPU:

!git clone https://github.com/SJ001/AI-Feynman.git
!cd /content/AI-Feynman && git reset --hard 28edde1a36a166a081de84999ab4fd40071957db

dbl001 avatar Oct 11 '22 16:10 dbl001