AI-Feynman icon indicating copy to clipboard operation
AI-Feynman copied to clipboard

Runtime error: size mismatch for linear1.weight: copying a param with shape

Open moridinamael opened this issue 4 years ago • 7 comments

I have a dataset that runs fine for about an hour of steady processing with apparent improvement before giving this error. See the attached output log. It seems to have something to do with the dimension changing, despite there being no issues with the dimensions in the dataset. Console output: error.txt

moridinamael avatar Jul 29 '20 23:07 moridinamael

I have reduced the variable count for subsequent test runs and still encounter exactly the same error (size mismatch for linear.weight ... ) with the network variable size eventually hitting a mismatch. In these cases the cases failed after almost 12 hours of computation.

moridinamael avatar Jul 31 '20 15:07 moridinamael


I'm having the same issue. The developers haven't responded in two weeks, perhaps it is time we bring this issue to our own hands. Meanwhile we wait for this bug to be fixed, we could try to fix this issue on our own, I can make an email thread where we can discuss AI-Feynman and accelerate the time in which this issue can be fixed. You can email me at [email protected] or you may also send your email so I can start the discussion thread of 4 people with this same issue. Thanks.


RodrigoSandon avatar Jul 31 '20 16:07 RodrigoSandon

@RodrigoSandon I sent you an email.

moridinamael avatar Jul 31 '20 20:07 moridinamael

Hey guys I'm getting this same error here:

Checking polyfit 

Complexity  RMSE  Expression
[0.0, 28.98901941844433, 'asin(0)']
Checking for symmetry 
 Dataset with clusters n.txt_train-translated_plus
Found pretrained NN 

RuntimeError                              Traceback (most recent call last)
<ipython-input-6-7c9e9ea3efd3> in <module>()
      1 from S_run_aifeynman import run_aifeynman
      2 # Run example 1 as the regression dataset
----> 3 run_aifeynman("/content/AI-Feynman/example_data/",'Dataset with clusters n.txt',30,"14ops.txt", polyfit_deg=3, NN_epochs=400)

4 frames
/content/AI-Feynman/Code/ in run_aifeynman(pathdir, filename, BF_try_time, BF_ops_file_type, polyfit_deg, NN_epochs, vars_name, test_percentage)
    163     PA = ParetoSet()
    164     # Run the code on the train data
--> 165     PA = run_AI_all(pathdir,filename+"_train",BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA=PA)
    166     PA_list = PA.get_pareto_points()

/content/AI-Feynman/Code/ in run_AI_all(pathdir, filename, BF_try_time, BF_ops_file_type, polyfit_deg, NN_epochs, PA)
     89         new_pathdir, new_filename = do_translational_symmetry_plus(pathdir,filename,symmetry_plus_result[1],symmetry_plus_result[2])
     90         PA1_ = ParetoSet()
---> 91         PA1 = run_AI_all(new_pathdir,new_filename,BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA1_)
     92         PA = add_sym_on_pareto(pathdir,filename,PA1,symmetry_plus_result[1],symmetry_plus_result[2],PA,"+")
     93         return PA

/content/AI-Feynman/Code/ in run_AI_all(pathdir, filename, BF_try_time, BF_ops_file_type, polyfit_deg, NN_epochs, PA)
     61     elif path.exists("results/NN_trained_models/models/" + filename + "_pretrained.h5"):
     62         print("Found pretrained NN \n")
---> 63         NN_train(pathdir,filename,NN_epochs/2,lrs=1e-3,N_red_lr=3,pretrained_path="results/NN_trained_models/models/" + filename + "_pretrained.h5")
     64         print("NN loss after training: ", NN_eval(pathdir,filename), "\n")
     65     else:

/content/AI-Feynman/Code/ in NN_train(pathdir, filename, epochs, lrs, N_red_lr, pretrained_path)
    115         if pretrained_path!="":
--> 116             model_feynman.load_state_dict(torch.load(pretrained_path))
    118         check_es_loss = 10000

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/ in load_state_dict(self, state_dict, strict)
   1050         if len(error_msgs) > 0:
   1051             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-> 1052                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1053         return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for SimpleNet:
	size mismatch for linear1.weight: copying a param with shape torch.Size([128, 10]) from checkpoint, the shape in current model is torch.Size([128, 9]).

Were you able to solve this?

I'm not loading from a pre-trained model, this just happens after running this notebook on my own data (after substituting 'example1.txt' with my own data:

youssefavx avatar Jan 26 '21 15:01 youssefavx

I just encountered this error too. Looks like it's been a while since it's remained unaddressed. Have you guys made any progress? Thanks.

mbadikyan avatar Apr 07 '21 05:04 mbadikyan

Hey guys, for anyone who's interested in this problem being debugged for $. Would you pay to have this solved? If so I could try my hand at it. Doesn't have to be big, e.g. $1 to $5 through paypal. If so, contact me at [email protected] or just let me know here

youssefavx avatar May 07 '21 13:05 youssefavx

There appears to be an issue where the new training data file generated in has one fewer variable then the saved model which gets loaded in when 'pretrained_path' code is executed. Synopsis : i. if pretrained_path!="": in the exception is generated loading the state_dict and running model.eval() ii. remove_input_neuron(model,n_variables,j,ct_median,"results/NN_trained_models/models/"+filename + "-translated_plus_pretrained.h5") in saves the state_dict and model with the original number of factors. iii. loads n_variables with the reduced number of factors. iv. elif path.exists("results/NN_trained_models/models/" + filename + "_pretrained.h5"): in loads the pretrained model with the original number of factors.

If you comment out this line in, you can circumvent the exception.

   elif path.exists("results/NN_trained_models/models/" + filename + "_pretrained.h5"):

When S_symmetry invokes 'remove_input_neuron(), the model has 'n' factors, however, the new 'data_translated' file saved in has 'n-1' factors. The discrepancy between the number of factors in the .h5 file and the data_translated file causes the exception in S_NN_train, when 'if_pretrained_path' code is executed.


Original training file (4 factors):


New training file (3 factors):


dbl001 avatar Nov 27 '22 18:11 dbl001