neuraloperator icon indicating copy to clipboard operation
neuraloperator copied to clipboard

Error reloading model from checkpoint

Open lyyc199586 opened this issue 11 months ago • 3 comments

I try to reload the saved model by:

saving:

model.save_checkpoint("./model", save_name="fno")

and load:

model_reload = FNO.from_checkpoint('./model', save_name="fno")

get error:

---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
Cell In[9], [line 12](vscode-notebook-cell:?execution_count=9&line=12)
      [1](vscode-notebook-cell:?execution_count=9&line=1) # reload model
      [2](vscode-notebook-cell:?execution_count=9&line=2) # model_reload = FNO(
      [3](vscode-notebook-cell:?execution_count=9&line=3) #     n_modes=(16,16),
   (...)
      [9](vscode-notebook-cell:?execution_count=9&line=9) # model_reload.load_state_dict(torch.load("./model/fno.pt", weights_only=False))
     [10](vscode-notebook-cell:?execution_count=9&line=10) # model_reload.eval()
---> [12](vscode-notebook-cell:?execution_count=9&line=12) model_reload = FNO.from_checkpoint('./model', save_name="fno")

File C:\workspace\no_playground\neuraloperator\neuralop\models\base_model.py:179, in BaseModel.from_checkpoint(cls, save_folder, save_name, map_location)
    [176](file:///C:/workspace/no_playground/neuraloperator/neuralop/models/base_model.py:176)     init_args = []
    [177](file:///C:/workspace/no_playground/neuraloperator/neuralop/models/base_model.py:177) instance = cls(*init_args, **init_kwargs)
--> [179](file:///C:/workspace/no_playground/neuraloperator/neuralop/models/base_model.py:179) instance.load_checkpoint(save_folder, save_name, map_location=map_location)
    [180](file:///C:/workspace/no_playground/neuraloperator/neuralop/models/base_model.py:180) return instance

File C:\workspace\no_playground\neuraloperator\neuralop\models\base_model.py:159, in BaseModel.load_checkpoint(self, save_folder, save_name, map_location)
    [157](file:///C:/workspace/no_playground/neuraloperator/neuralop/models/base_model.py:157) save_folder = Path(save_folder)
    [158](file:///C:/workspace/no_playground/neuraloperator/neuralop/models/base_model.py:158) state_dict_filepath = save_folder.joinpath(f'{save_name}_state_dict.pt').as_posix()
--> [159](file:///C:/workspace/no_playground/neuraloperator/neuralop/models/base_model.py:159) self.load_state_dict(torch.load(state_dict_filepath, map_location=map_location))

File c:\workspace\no_playground\no\lib\site-packages\torch\serialization.py:1470, in load(f, map_location, pickle_module, weights_only, mmap, **pickle_load_args)
   [1462](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1462)                 return _load(
   [1463](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1463)                     opened_zipfile,
   [1464](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1464)                     map_location,
   (...)
   [1467](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1467)                     **pickle_load_args,
   [1468](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1468)                 )
   [1469](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1469)             except pickle.UnpicklingError as e:
-> [1470](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1470)                 raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
   [1471](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1471)         return _load(
   [1472](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1472)             opened_zipfile,
   [1473](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1473)             map_location,
   (...)
   [1476](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1476)             **pickle_load_args,
   [1477](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1477)         )
   [1478](file:///C:/workspace/no_playground/no/lib/site-packages/torch/serialization.py:1478) if mmap:

UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
	(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
	WeightsUnpickler error: Unsupported global: GLOBAL torch._C._nn.gelu was not an allowed global by default. Please use `torch.serialization.add_safe_globals([gelu])` or the `torch.serialization.safe_globals([gelu])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

It seems that PyTorch 2.6 is not compatible. what version of torch should neuralop package use?

I can do load using:

# save
torch.save(model.state_dict(), "./model/fno.pt")

# reload
model_reload = FNO(
     n_modes=(16,16),
     in_channels=1,
     out_channels=1,
     hidden_channels=32,
     projection_channel_ratio=2
)
model_reload.load_state_dict(torch.load("./model/fno.pt", weights_only=False))

lyyc199586 avatar Mar 18 '25 20:03 lyyc199586

Hi @lyyc199586 , thanks for opening this. You're right that this is an issue in torch 2.6, so I've opened #559 to fix this

dhpitt avatar Mar 20 '25 21:03 dhpitt

Thanks for the update, now the model save and load is working! However, I found resume model training from checkpoints failed due to the same error:

trainer = Trainer(model=model, n_epochs=100,
                  device=device,
                  data_processor=data_processor,
                  wandb_log=False,
                  eval_interval=1,
                  use_distributed=False,
                  verbose=True)

trainer.train(train_loader=mod_train_loader,
              test_loaders=mod_test_loaders,
              optimizer=optimizer,
              scheduler=scheduler,
              regularizer=False,
              training_loss=train_loss,
              eval_losses=eval_losses,
              save_every=10,
            #   save_dir="./ckpt/sparse_mask/",
              resume_from_dir="./ckpt/sparse_mask/"
              )

error:

-----------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
Cell In[10], [line 10](vscode-notebook-cell:?execution_count=10&line=10)
      [1](vscode-notebook-cell:?execution_count=10&line=1) # train
      [2](vscode-notebook-cell:?execution_count=10&line=2) trainer = Trainer(model=model, n_epochs=100,
      [3](vscode-notebook-cell:?execution_count=10&line=3)                   device=device,
      [4](vscode-notebook-cell:?execution_count=10&line=4)                   data_processor=data_processor,
   (...)
      [7](vscode-notebook-cell:?execution_count=10&line=7)                   use_distributed=False,
      [8](vscode-notebook-cell:?execution_count=10&line=8)                   verbose=True)
---> [10](vscode-notebook-cell:?execution_count=10&line=10) trainer.train(train_loader=mod_train_loader,
     [11](vscode-notebook-cell:?execution_count=10&line=11)               test_loaders=mod_test_loaders,
     [12](vscode-notebook-cell:?execution_count=10&line=12)               optimizer=optimizer,
     [13](vscode-notebook-cell:?execution_count=10&line=13)               scheduler=scheduler,
     [14](vscode-notebook-cell:?execution_count=10&line=14)               regularizer=False,
     [15](vscode-notebook-cell:?execution_count=10&line=15)               training_loss=train_loss,
     [16](vscode-notebook-cell:?execution_count=10&line=16)               eval_losses=eval_losses,
     [17](vscode-notebook-cell:?execution_count=10&line=17)               save_every=10,
     [18](vscode-notebook-cell:?execution_count=10&line=18)             #   save_dir="./ckpt/sparse_mask/",
     [19](vscode-notebook-cell:?execution_count=10&line=19)               resume_from_dir="./ckpt/sparse_mask/"
     [20](vscode-notebook-cell:?execution_count=10&line=20)               )

File C:\workspace\no_playground\neuraloperator\neuralop\training\trainer.py:177, in Trainer.train(self, train_loader, test_loaders, optimizer, scheduler, regularizer, training_loss, eval_losses, save_every, save_best, save_dir, resume_from_dir)
    [175](file:///C:/workspace/no_playground/neuraloperator/neuralop/training/trainer.py:175) self.save_best = save_best
    [176](file:///C:/workspace/no_playground/neuraloperator/neuralop/training/trainer.py:176) if resume_from_dir is not None:
...
	(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
	WeightsUnpickler error: Unsupported global: GLOBAL neuralop.layers.embeddings.GridEmbedding2D was not an allowed global by default. Please use `torch.serialization.add_safe_globals([GridEmbedding2D])` or the `torch.serialization.safe_globals([GridEmbedding2D])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

lyyc199586 avatar Mar 21 '25 16:03 lyyc199586

Opened #563 to address this

dhpitt avatar Apr 02 '25 20:04 dhpitt