pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Make sure the upcoming change in the default for `weights_only` from False to True is handled correctly

Open lantiga opened this issue 1 year ago • 5 comments

Bug description

Reference: https://dev-discuss.pytorch.org/t/bc-breaking-change-torch-load-is-being-flipped-to-use-weights-only-true-by-default-in-the-nightlies-after-137602/2573

What version are you seeing the problem on?

master

How to reproduce the bug

-

Error messages and logs

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4): 2.6+
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

lantiga avatar Nov 26 '24 12:11 lantiga

Any progress on this? it currently prevents us from resuming training with PyTorch 2.6 and ~lightning 2.5~ lightning 2.3

ORippler avatar Feb 03 '25 10:02 ORippler

Wanted to check this issue as well. My existing workflow of trainer.fit(..., ckpt_path=<CKPT_PATH>) is also currently blocked because of the default behavior change in torch 2.6

yair-schiff avatar Apr 01 '25 03:04 yair-schiff

Wanted to check this issue as well. My existing workflow of trainer.fit(..., ckpt_path=<CKPT_PATH>) is also currently blocked because of the default behavior change in torch 2.6

Curious but what's your lightning version? Support was added in lightning 2.4, I erroneously used an old env with lightning 2.3

ORippler avatar Apr 07 '25 13:04 ORippler

@ORippler , I see, thanks. I think it's actually 2.2 that I'm using. I'll try upgrading

yair-schiff avatar Apr 07 '25 13:04 yair-schiff

I am running into this exact issue when using a model parallel strategy and resuming from a distributed checkpoint using the latest PyTorch lightning and PyTorch 2.6. The bug is occurring at line 449 of lightning/fabric/strategies/model_parallel.py. This bug is still present in main.

Should I open up a PR, or is it already part of some other PR?

KyleMylonakisProtopia avatar Apr 29 '25 06:04 KyleMylonakisProtopia

An optimal solution here might be to expose weights_only to LightningModule.load_from_checkpoint. Right now we're on torch 2.4 and I see the warning, and I believe that once we're on torch 2.6 this will break. I'm having a hard time telling just from reading the code if passing weights_only=False as a kwarg to load_from_checkpoint will properly make its way to the underlying torch load function.

daturkel avatar Aug 07 '25 02:08 daturkel