ai-imu-dr icon indicating copy to clipboard operation
ai-imu-dr copied to clipboard

train parameters size mismatch

Open kebijuelun opened this issue 2 years ago • 37 comments

  • follow testing steps, but meet the following error. It seems that the model parameters do not correspond to the model definition.
/data/github_code/ai-imu-dr/src/main_kitti.py in launch(args)
     29 
     30     if args.test_filter:
---> 31         test_filter(args, dataset)
     32 
     33     if args.results_filter:

/data/github_code/ai-imu-dr/src/main_kitti.py in test_filter(args, dataset)
    427     from IPython import embed; embed()
    428 
--> 429     torch_iekf.load(args, dataset)
    430     iekf.set_learned_covariance(torch_iekf)
    431 

/data/github_code/ai-imu-dr/src/utils_torch_filter.py in load(self, args, dataset)
    461         if os.path.isfile(path_iekf):
    462             mondict = torch.load(path_iekf)
--> 463             self.load_state_dict(mondict)
    464             cprint("IEKF nets loaded", 'green')
    465         else:

~/miniconda3/envs/dfvo/lib/python3.6/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
    775         if len(error_msgs) > 0:
    776             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 777                                self.__class__.__name__, "\n\t".join(error_msgs)))
    778         return _IncompatibleKeys(missing_keys, unexpected_keys)
    779 

RuntimeError: Error(s) in loading state_dict for TORCHIEKF:
        Unexpected key(s) in state_dict: "mes_net.cov_net.8.weight", "mes_net.cov_net.8.bias", "mes_net.cov_net.12.weight", "mes_net.cov_net.12.bias", "mes_net.cov_net.16.weight", "mes_net.cov_net.16.bias". 
        size mismatch for mes_net.cov_net.4.weight: copying a param with shape torch.Size([64, 32, 5]) from checkpoint, the shape in current model is torch.Size([32, 32, 5]).
        size mismatch for mes_net.cov_net.4.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
  • torch version
torch                              1.1.0     
torchvision                        0.3.0 

kebijuelun avatar Dec 28 '21 06:12 kebijuelun

make continue_training False in class KITTIArgs() of main_kitty.py

kimms74 avatar Jan 03 '22 02:01 kimms74

make continue_training False in class KITTIArgs() of main_kitty.py

It still have the same error for parameter mismatching after setting the continue_training False. Would you have any other idea about this problems?

hmf21 avatar Jan 10 '22 06:01 hmf21

Same issue here

lumyus avatar Jan 10 '22 09:01 lumyus

I also get something very similar: RuntimeError: Error(s) in loading state_dict for TORCHIEKF: Unexpected key(s) in state_dict: "mes_net.cov_net.8.weight", "mes_net.cov_net.8.bias", "mes_net.cov_net.12.weight", "mes_net.cov_net.12.bias", "mes_net.cov_net.16.weight", "mes_net.cov_net.16.bias". size mismatch for mes_net.cov_net.4.weight: copying a param with shape torch.Size([64, 32, 5]) from checkpoint, the shape in current model is torch.Size([32, 32, 5]). size mismatch for mes_net.cov_net.4.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).

Part of the problem goes away if you adjust the sizes in mesnet but either I cannot find (so far) make the right size adjustments to make completely the problem go away,

=> This happens if path_iekf finds the file ../temp/iekfnets.p However, if it is not there the program carries and I still get the beautiful plot as shown in Github namely the route segment of file 2011_09_30_drive_0028_extract

scott81321 avatar Jan 10 '22 11:01 scott81321

Same here. If you put "train_filter = 1" and then back to "train_filter = 0" the IEKF will be loaded however that's not the trained model which was specificed at the URL in the Readme. Also changing the network parameters (shape) did not work for me. Any idea on whats going on here @mbrossar ?

lumyus avatar Jan 10 '22 11:01 lumyus

I am still working with the original default at train_filter = 0 and test_filter=1. I cannot complain because the curves obtained are BEAUTIFUL (Merci Martin) but I realize that some training had to be used to get that BEAUTIFUL curve. The code reads in ../temp/normalize_factors.p and needs them

I suppose an adjacent question is: how to get the curves with pure IEKF and no help from the AI or CNN part?

scott81321 avatar Jan 10 '22 12:01 scott81321

According to @scott81321 , I delete the ../temp/iekfnets.p and also get some curves which seem to be generated from the mesnet with randomly initialized parameters. And refer to the paper, mesnet is composed of two Conv layers but the iekfnets.p gives a model with five layers. Is there anything wrong with the implement?

hmf21 avatar Jan 10 '22 12:01 hmf21

Hello @hmf17 can you read the contents of iekfnets.p? What little I know is that they contain the CNN (mes.net). I can get a picture of it using netron but can you give me the python instructions to read the contents?

CNN_model

scott81321 avatar Jan 10 '22 12:01 scott81321

Hi @scott81321 , I use torch.load() to read the contents of iekfnets.p and the result is shown in picture below. Although the picture is not intuitive, it seems the stucture is different from your picture which only contains two conv layers. How do you get this diagram? It is very beautiful.

image

hmf21 avatar Jan 10 '22 12:01 hmf21

Hello @hmf17 Thank you. To get that picture of the CNN, I use a relatively new software called netron. You can use it online https://netron.app/ or download it from Github https://github.com/lutzroeder/netron. You have to create a .pt file inside init in class TORCHIEKF. After the instruction: self.mes_net = MesNet() then save the CNN model with PATH = "...../CNN_model.pt"
torch.save(self.mes_net, PATH) Once you have that, then load it into netron

I do see something weird in the picture you just showed me , dimension indices as high as 128? Your picture is beautiful also. I used torchload() but then followed with a print statement which gives too many details. How did you get the tensor dimensions upfront?

scott81321 avatar Jan 10 '22 13:01 scott81321

Hi @scott81321 , thank you for providing this powerful software. I just simply use Pycharm to see the details in iekfnets.p and you can see the prameter states in the variables toolbar. The max output peature dimension is 128 in this model which is quiet different from the description in the paper. And I still have no progress for running this program, do you have any good idea?

hmf21 avatar Jan 10 '22 14:01 hmf21

Oh! just use the code as originally loaded and remove iekfnets.p from the temp sub-directory [just put iekfnets.p elsewhere]. If it cannot find the file, it gives a print statement [look for cprint("IEKF nets NOT loaded", 'yellow') in utils_torch_filter.py] but carries on nonetheless. The original version that you can download only uses normalize_factors.p [make sure train_filter=0]. I got the code working on the test files producing 10 ensembles of graphs. What I would like to know is how to get the results without the training i.e. pure IEKF because ironically, even though I am clearly NOT loading iekfnets.p, the picture I get for 2011_09_30_drive_0028_extract i.e. file position_xy.png looks like the result enhanced with AI (CNN) not the raw IEKF result.

Please, can you give me the specific Python command(s) to print out the contents of iekfnets.p ??

scott81321 avatar Jan 10 '22 14:01 scott81321

Hi @scott81321 , I just use some simple commands : path_iekf = './temp/iekfnets.p' mondict = torch.load(path_iekf) then I can see the content of the loaded model in Variables toolbar on the right.

hmf21 avatar Jan 10 '22 15:01 hmf21

Thx. Here is what netron gives for iekfnets.pt (note as a pt file) iekfnets

scott81321 avatar Jan 10 '22 19:01 scott81321

great! @scott81321

hmf21 avatar Jan 11 '22 02:01 hmf21

So did anyone get it to work? I mean actually use your own data to get results? The plots seem to be generated no matter what model is used..

lumyus avatar Jan 11 '22 19:01 lumyus

I got it to work for the datasets downloaded from github. Not on my own data yet. I need to better understand his code. E.g. how to switch on the neural network and not use it i.e. pure IEKF.

scott81321 avatar Jan 11 '22 19:01 scott81321

Nice! What did you change? Running the model which is provided by the author does not work..

lumyus avatar Jan 11 '22 21:01 lumyus

@scott81321 @hmf17 Hi, I wonder how you guys got the program working with training (train_filter = 1), even with the KITTI datasets that Martin originally used? When I read in the datasets, and start training, I got the following error that I have no clue about:

_Sequence name : 2011_09_30_drive_0028_sync

Sequence name : 2011_09_30_drive_0033_sync Dataset is too short (15.94 s)

Sequence name : 2011_09_30_drive_0034_sync Dataset is too short (12.24 s)

Sequence name : 2011_09_30_drive_0072_sync Dataset is too short (0.05 s)

Total dataset duration : 825.41 s IEKF nets NOT loaded Traceback (most recent call last): File "main_kitti.py", line 484, in launch(KITTIArgs) File "main_kitti.py", line 28, in launch train_filter(args, dataset) File "/home/terryl/projects/AI-IMU-DR/ai-imu-dr/src/train_torch_filter.py", line 61, in train_filter prepare_loss_data(args, dataset) File "/home/terryl/projects/AI-IMU-DR/ai-imu-dr/src/train_torch_filter.py", line 108, in prepare_loss_data Rot_gt = torch.zeros(Ns[1], 3, 3) TypeError: zeros() received an invalid combination of arguments - got (NoneType, int, int), but expected one of:

  • (tuple of ints size, *, tuple of names names, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
  • (tuple of ints size, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)_

Hopefully you guys can give me some advice on how to get over this error and most importantly, get the training program working first. I tend to tailor the program toward my application by training model with own datasets if all possible.

I'm using PyTorch 1.0.0 with GPU version

Thanks in advance! Terry

Hazeline2018 avatar Mar 11 '22 19:03 Hazeline2018

@kebijuelun @scott81321 @lumyus @hmf17 did you guys make any progress on resolving this problem?

The plots seem pretty good even with randomly initialized parameters.

I've modified the sizes of the layers of the Mesnet which resolved some of the errors but this error continues to persist.

"RuntimeError: mat1 and mat2 shapes cannot be multiplied (47945x64 and 32x2)"

saltrack avatar Mar 24 '22 13:03 saltrack

Hi, I also met the proplem of mismatch of mesnet size. When I deleted the iekfnets.p and run the code without CNN, the result looked good. I wonder how can I run the code with CNN? At the mean time, why the result without CNN adapter has been so good? Thanks a lot :)

nothing371442 avatar Apr 27 '22 10:04 nothing371442

Hi, I also met the proplem of mismatch of mesnet size. When I deleted the iekfnets.p and run the code without CNN, the result looked good. I wonder how can I run the code with CNN? At the mean time, why the result without CNN adapter has been so good? Thanks a lot :)

The problem of dismatich can be solved, by turning on the train option (set to 1) and it can generate a new iekfnets.p which can be used for test filter.

nothing371442 avatar May 05 '22 14:05 nothing371442

@nothing371442 didn't you get any errors while training as mentioned in #72?

Did you make any changes to getting the train option (set to 1) working on the existing dataset provided by the author? Could you help me out with it.

Rajat-Arora avatar Nov 19 '22 02:11 Rajat-Arora

@nothing371442 didn't you get any errors while training as mentioned in #72?

Did you make any changes to getting the train option (set to 1) working on the existing dataset provided by the author? Could you help me out with it.

Did you delete the iekfnets.p file first? I delete the iekfnets.p file firstly, and do train option (set to 1), which can generate a new .p file.

nothing371442 avatar Nov 19 '22 03:11 nothing371442

@nothing371442 didn't you get any errors while training as mentioned in #72? Did you make any changes to getting the train option (set to 1) working on the existing dataset provided by the author? Could you help me out with it.

Did you delete the iekfnets.p file first? I delete the iekfnets.p file firstly, and do train option (set to 1), which can generate a new .p file.

Yes, I have deleted this file and set the train option (set to 1), but it gives me an error similar to #72.

image image

Rajat-Arora avatar Nov 19 '22 03:11 Rajat-Arora

Hi guys. As I can tell there is a mismatch in format between the file iekfnets.p and what CNN format is. Notice that Brossard's default is on test mode, not train mode. I saw discrepancies in the values for the noise covariances of his thesis and what he encoded for the OXTS data files of his test data. This suggests to me that he hardwired these numbers to get the best test results for his test cases and kind of relinquished the training aspect in a pragmatic way. These noise covariances are in the initials ones on main_kitti.py and less importantly in utils_numpy_filter.py I had to modify the ones in main_kitti.py to get the best results for the data given to me.

So I would like to ask all of you: what does iefknets.p contain? Is it only noise covariances? If so, which ones?

scott81321 avatar Nov 20 '22 07:11 scott81321

@nothing371442 didn't you get any errors while training as mentioned in #72? Did you make any changes to getting the train option (set to 1) working on the existing dataset provided by the author? Could you help me out with it.

Did you delete the iekfnets.p file first? I delete the iekfnets.p file firstly, and do train option (set to 1), which can generate a new .p file.

Yes, I have deleted this file and set the train option (set to 1), but it gives me an error similar to #72.

image image

Hi, did you download the provided delta_p.p file firstly?

nothing371442 avatar Nov 21 '22 13:11 nothing371442

Hi guys. As I can tell there is a mismatch in format between the file iekfnets.p and what CNN format is. Notice that Brossard's default is on test mode, not train mode. I saw discrepancies in the values for the noise covariances of his thesis and what he encoded for the OXTS data files of his test data. This suggests to me that he hardwired these numbers to get the best test results for his test cases and kind of relinquished the training aspect in a pragmatic way. These noise covariances are in the initials ones on main_kitti.py and less importantly in utils_numpy_filter.py I had to modify the ones in main_kitti.py to get the best results for the data given to me.

So I would like to ask all of you: what does iefknets.p contain? Is it only noise covariances? If so, which ones? I think it contains net parameters like pic below net_para

nothing371442 avatar Nov 21 '22 13:11 nothing371442

@nothing371442 didn't you get any errors while training as mentioned in #72? Did you make any changes to getting the train option (set to 1) working on the existing dataset provided by the author? Could you help me out with it.

Did you delete the iekfnets.p file first? I delete the iekfnets.p file firstly, and do train option (set to 1), which can generate a new .p file.

Yes, I have deleted this file and set the train option (set to 1), but it gives me an error similar to #72. image image

Hi, did you download the provided delta_p.p file firstly?

I was able to figure it out and train the model, there were some issues regarding the version of PyTorch that I was using.

Rajat-Arora avatar Nov 22 '22 18:11 Rajat-Arora

Hi @scott81321, could you please describe more about that actually what modifications were done in main_kitti.py to get the best results? Also, you mentioned data given to you, so are you talking about the dataset given to you by the author or your own dataset?

Hi guys. As I can tell there is a mismatch in format between the file iekfnets.p and what CNN format is. Notice that Brossard's default is on test mode, not train mode. I saw discrepancies in the values for the noise covariances of his thesis and what he encoded for the OXTS data files of his test data. This suggests to me that he hardwired these numbers to get the best test results for his test cases and kind of relinquished the training aspect in a pragmatic way. These noise covariances are in the initials ones on main_kitti.py and less importantly in utils_numpy_filter.py I had to modify the ones in main_kitti.py to get the best results for the data given to me.

So I would like to ask all of you: what does iefknets.p contain? Is it only noise covariances? If so, which ones?

Hi @scott81321, could you please describe more about that actually what modifications were done in main_kitti.py to get the best results? Also, you mentioned data given to you, so are you talking about the dataset given to you by the author or your dataset?

Rajat-Arora avatar Nov 22 '22 18:11 Rajat-Arora