FutureWarning: You are using `torch.load` with `weights_only=False`
Describe the bug
FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(filename, lambda storage, loc: storage)
This warning is triggered by all torch.load used in stanza. The issue does not cause any problem with data processing at the moment but the long warnings are distracting.
To Reproduce Steps to reproduce the behavior:
- upgrade torch to 2.4.1
Expected behavior no error
Environment (please complete the following information):
- OS: Windows
- Python version: python 3.12.7
- Stanza version: 1.9.2
The error can be suppressed by using the following before calling stanza functions but is not a solution
import warnings warnings.simplefilter(action='ignore', category=FutureWarning)
source: https://github.com/ultralytics/ultralytics/issues/14994#issuecomment-2364356239
Aware of it. There's a limitation where we are saving plenty of things other than weights in the current file. Config strings and numbers, mostly. Would those still work?
On Tue, Oct 22, 2024, 10:19 PM mskaif @.***> wrote:
The error can be suppressed by using the following before calling stanza functions but is not a solution
import warnings warnings.simplefilter(action='ignore', category=FutureWarning)
source: ultralytics/ultralytics#14994 (comment) https://github.com/ultralytics/ultralytics/issues/14994#issuecomment-2364356239
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1429#issuecomment-2430930006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWPIB3WEPPBSGN337ALZ44WXBAVCNFSM6AAAAABQN6FUXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZQHEZTAMBQGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Some of the models can be updated to use weights_only=True right away, but others require resaving with enums or other data structures removed. Will have to investigate some more.
Some of the models can be updated to use
weights_only=Trueright away, but others require resaving with enums or other data structures removed. Will have to investigate some more.
sorry for not getting back earlier. I'm using the built-in models like so: STANZA_PIPE = stanza.Pipeline( lang="en", dir=settings.STANZA_DATA_DIR, processors="tokenize,mwt,pos", download_method=None, use_gpu=False, )
affected from the pipeline are: tokenization\trainer.py:82 mwt\trainer.py:201 pos\trainer.py:139 common\pretrain.py:56 common\char_model.py:271
Thank you for the commit!
Please, be aware that on pytorch 2.6 this warning will become an error. That got reported to pytorch as:
- https://github.com/pytorch/pytorch/issues/142123
I posted more details in https://github.com/pytorch/pytorch/issues/142123#issuecomment-2524667964, but shortly https://github.com/huggingface/transformers/pull/34632 PR on pytorch side has flipped the default of weights_only from False to True in the upcoming pytorch 2.6.
You can consider to add explicit list of allowed safe globals following similar approach which was done in Huggingface Transformers and Accelerate. For the reference, see:
- https://github.com/huggingface/transformers/pull/34632
I am finishing up some model training and will be able to make a new release with the updated models soon.
@AngledLuffa : note that at the moment the failure reported in https://github.com/pytorch/pytorch/issues/142123 is not fixed in the latest stanza from main branch (I tried https://github.com/stanfordnlp/stanza/commit/539760cdc5a903de23895db46e5d7c2e1f8f251b - see log below). The repro is with:
import stanza
pos_pipeline = stanza.Pipeline(lang='en', processors='tokenize,pos', use_gpu=True, device='xpu')
sentence = "Some sentence"
pos_pipeline(sentence)
The https://github.com/stanfordnlp/stanza/pull/1430 previously merged in stanza is not enough to handle this case. The failure happens on this torch.load():
https://github.com/stanfordnlp/stanza/blob/539760cdc5a903de23895db46e5d7c2e1f8f251b/stanza/models/common/pretrain.py#L56
Full log:
2024-12-06 16:29:16 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json: 392kB [00:00, 70.6MB/s]
2024-12-06 16:29:17 INFO: Downloaded file to /home/dvrogozh/stanza_resources/resources.json
2024-12-06 16:29:17 WARNING: Language en package default expects mwt, which has been added
2024-12-06 16:29:17 INFO: Loading these models for language: en (English):
===============================
| Processor | Package |
-------------------------------
| tokenize | combined |
| mwt | combined |
| pos | combined_charlm |
===============================
2024-12-06 16:29:17 INFO: Using device: xpu
2024-12-06 16:29:17 INFO: Loading: tokenize
2024-12-06 16:29:18 INFO: Loading: mwt
2024-12-06 16:29:18 INFO: Loading: pos
/home/dvrogozh/git/pytorch/pytorch/torch/_weights_only_unpickler.py:515: UserWarning: Detected pickle protocol 3 in the checkpoint, which was not the default pickle protocol used by `torch.load` (2). The weights_only Unpickler might not support all instructions implemented by this protocol, please file an issue for adding support if you encounter this.
warnings.warn(
Traceback (most recent call last):
File "/home/dvrogozh/tmp/st.py", line 3, in <module>
pos_pipeline = stanza.Pipeline(lang='en', processors='tokenize,pos', use_gpu=True, device='xpu')
File "/home/dvrogozh/git/stanza/stanza/pipeline/core.py", line 308, in __init__
self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config,
File "/home/dvrogozh/git/stanza/stanza/pipeline/processor.py", line 193, in __init__
self._set_up_model(config, pipeline, device)
File "/home/dvrogozh/git/stanza/stanza/pipeline/pos_processor.py", line 32, in _set_up_model
self._trainer = Trainer(pretrain=self.pretrain, model_file=config['model_path'], device=device, args=args, foundation_cache=pipeline.foundation_cache)
File "/home/dvrogozh/git/stanza/stanza/models/pos/trainer.py", line 34, in __init__
self.load(model_file, pretrain, args=args, foundation_cache=foundation_cache)
File "/home/dvrogozh/git/stanza/stanza/models/pos/trainer.py", line 174, in load
emb_matrix = pretrain.emb
File "/home/dvrogozh/git/stanza/stanza/models/common/pretrain.py", line 50, in emb
self.load()
File "/home/dvrogozh/git/stanza/stanza/models/common/pretrain.py", line 56, in load
data = torch.load(self.filename, lambda storage, loc: storage)
File "/home/dvrogozh/git/pytorch/pytorch/torch/serialization.py", line 1480, in load
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray._reconstruct was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_reconstruct])` or the `torch.serialization.safe_globals([_reconstruct])` context manager to allowlist this global if you trust this class/function.
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
(pytorch.xpu) dvrogozh@willow-spr03:~/tmp$ cat st.py
import stanza
pos_pipeline = stanza.Pipeline(lang='en', processors='tokenize,pos', use_gpu=True, device='xpu')
sentence = "Some sentence"
pos_pipeline(sentence)
Got it, but that's the main branch. The updates merged in are in the dev branch, which at that line has torch.load(... weights_only=True)
https://github.com/stanfordnlp/stanza/blob/5754ec0488636e90cdab26f43d44583d4efc99f0/stanza/models/common/pretrain.py#L60
Got it, but that's the main branch. The updates merged in are in the dev branch, which at that line has torch.load(... weights_only=True)
Ah, sorry. I missed that.
This should now be pushed in v1.10.0
This should now be pushed in v1.10.0 this problem still exists in 1.10.1 if I use zh-hans 1.9.0. And if i used zh-hans 1.10.0, this error arised:
ValueError: md5 for D:\Anaconda\envs\stanza\Lib\site-packages\stanza\stanza_resources\zh-hans\tokenize\gsdsimp.pt is 48f993223d568afedc2893f7cd76719c, expected 68fb709f2a556b132b4915f2b3893ce7
It sounds like you have the old models on your system and aren't downloading them. Which is weird, since the Pipeline should automatically download the models for the new version. Are you creating the Pipeline in a way that stops it from downloading?
thanks for you reply! this is the whole part of my script:
import stanza import os
nlp = stanza.Pipeline(lang='zh-hans')
input_path = r"cleanned" output_path = r"parsed" text_name = [file.split('.')[0] for file in os.listdir(input_path) if file.endswith('.txt')] text_list = [os.path.join(input_path, file) for file in os.listdir(input_path) if file.endswith('.txt')] texts = [open(i, "r", encoding='utf-8') for i in text_list]
for i in range(len(texts)): doc = nlp(texts[i].read())
打印CoNLL-U格式的输出
with open(output_path + "/" + text_name[i] + ".conllu", "w", encoding='utf-8') as f: for sentence in doc.sentences: for word in sentence.words: f.write(f"{word.id}\t{word.text}\t{word.lemma}\t{word.upos}\t{word.xpos}\t{word.feats}\t{word.head}\t{word.deprel}\t{word.deps}\t{word.misc}\n") f.write("\n")
I have tried to download 1.10.1,1.9.0,1.8.0 manually. But this bug still raised. and if i use old models, it will show:
2025-01-27 23:27:30 INFO: Using device: cpu
2025-01-27 23:27:30 INFO: Loading: tokenize
2025-01-27 23:27:30 ERROR: Cannot load model from D:\Anaconda\envs\stanza\Lib\site-packages\stanza\stanza_resources\zh-hans\tokenize\gsdsimp.pt
Traceback (most recent call last):
File "d:\papers_2\cooperation\华语树库\stanza_parse.py", line 7, in torch.load with weights_only set to False will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with weights_only=True please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL builtins.set was not an allowed global by default. Please use torch.serialization.add_safe_globals([set]) to allowlist this global if you trust this class/function.
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
在 2025-01-27 23:45:58,"John Bauer" @.***> 写道:
It sounds like you have the old models on your system and aren't downloading them. Which is weird, since the Pipeline should automatically download the models for the new version. Are you creating the Pipeline in a way that stops it from downloading?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
highly recommend using backticks ` to format code
when the script is running, does it say something like
Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
it should. the md5sum for the tokenizer model for version 1.10 is 48f993223d568afedc2893f7cd76719c, so it looks like you downloaded the right model but somehow didn't download the resources. are you able to download this file?
https://github.com/stanfordnlp/stanza-resources/blob/main/resources_1.10.0.json
the block for the Chinese models is here:
https://github.com/stanfordnlp/stanza-resources/blob/f06522caadca99c72200e20ee158fe5e63b75e97/resources_1.10.0.json#L12350
your local version of the resources file should look like that
now it is working! it should be i used the wrong resources.json. Thank you very much!
At 2025-01-27 23:59:53, "John Bauer" @.***> wrote:
highly recommend using backticks ` to format code
when the script is running, does it say something like
Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
it should. the md5sum for the tokenizer model for version 1.10 is 48f993223d568afedc2893f7cd76719c, so it looks like you downloaded the right model but somehow didn't download the resources. are you able to download this file?
https://github.com/stanfordnlp/stanza-resources/blob/main/resources_1.10.0.json
the block for the Chinese models is here:
https://github.com/stanfordnlp/stanza-resources/blob/f06522caadca99c72200e20ee158fe5e63b75e97/resources_1.10.0.json#L12350
your local version of the resources file should look like that
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>