stanza icon indicating copy to clipboard operation
stanza copied to clipboard

FutureWarning: You are using `torch.load` with `weights_only=False`

Open mskaif opened this issue 1 year ago • 1 comments

Describe the bug FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. state = torch.load(filename, lambda storage, loc: storage)

This warning is triggered by all torch.load used in stanza. The issue does not cause any problem with data processing at the moment but the long warnings are distracting.

To Reproduce Steps to reproduce the behavior:

  1. upgrade torch to 2.4.1

Expected behavior no error

Environment (please complete the following information):

  • OS: Windows
  • Python version: python 3.12.7
  • Stanza version: 1.9.2

mskaif avatar Oct 23 '24 05:10 mskaif

The error can be suppressed by using the following before calling stanza functions but is not a solution

import warnings warnings.simplefilter(action='ignore', category=FutureWarning)

source: https://github.com/ultralytics/ultralytics/issues/14994#issuecomment-2364356239

mskaif avatar Oct 23 '24 05:10 mskaif

Aware of it. There's a limitation where we are saving plenty of things other than weights in the current file. Config strings and numbers, mostly. Would those still work?

On Tue, Oct 22, 2024, 10:19 PM mskaif @.***> wrote:

The error can be suppressed by using the following before calling stanza functions but is not a solution

import warnings warnings.simplefilter(action='ignore', category=FutureWarning)

source: ultralytics/ultralytics#14994 (comment) https://github.com/ultralytics/ultralytics/issues/14994#issuecomment-2364356239

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1429#issuecomment-2430930006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWPIB3WEPPBSGN337ALZ44WXBAVCNFSM6AAAAABQN6FUXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZQHEZTAMBQGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AngledLuffa avatar Oct 23 '24 06:10 AngledLuffa

Some of the models can be updated to use weights_only=True right away, but others require resaving with enums or other data structures removed. Will have to investigate some more.

AngledLuffa avatar Oct 24 '24 06:10 AngledLuffa

Some of the models can be updated to use weights_only=True right away, but others require resaving with enums or other data structures removed. Will have to investigate some more.

sorry for not getting back earlier. I'm using the built-in models like so: STANZA_PIPE = stanza.Pipeline( lang="en", dir=settings.STANZA_DATA_DIR, processors="tokenize,mwt,pos", download_method=None, use_gpu=False, )

affected from the pipeline are: tokenization\trainer.py:82 mwt\trainer.py:201 pos\trainer.py:139 common\pretrain.py:56 common\char_model.py:271

Thank you for the commit!

mskaif avatar Oct 25 '24 05:10 mskaif

Please, be aware that on pytorch 2.6 this warning will become an error. That got reported to pytorch as:

  • https://github.com/pytorch/pytorch/issues/142123

I posted more details in https://github.com/pytorch/pytorch/issues/142123#issuecomment-2524667964, but shortly https://github.com/huggingface/transformers/pull/34632 PR on pytorch side has flipped the default of weights_only from False to True in the upcoming pytorch 2.6.

You can consider to add explicit list of allowed safe globals following similar approach which was done in Huggingface Transformers and Accelerate. For the reference, see:

  • https://github.com/huggingface/transformers/pull/34632

dvrogozh avatar Dec 07 '24 00:12 dvrogozh

I am finishing up some model training and will be able to make a new release with the updated models soon.

AngledLuffa avatar Dec 07 '24 00:12 AngledLuffa

@AngledLuffa : note that at the moment the failure reported in https://github.com/pytorch/pytorch/issues/142123 is not fixed in the latest stanza from main branch (I tried https://github.com/stanfordnlp/stanza/commit/539760cdc5a903de23895db46e5d7c2e1f8f251b - see log below). The repro is with:

import stanza
pos_pipeline = stanza.Pipeline(lang='en', processors='tokenize,pos', use_gpu=True, device='xpu')
sentence = "Some sentence"
pos_pipeline(sentence)

The https://github.com/stanfordnlp/stanza/pull/1430 previously merged in stanza is not enough to handle this case. The failure happens on this torch.load(): https://github.com/stanfordnlp/stanza/blob/539760cdc5a903de23895db46e5d7c2e1f8f251b/stanza/models/common/pretrain.py#L56

Full log:

2024-12-06 16:29:16 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json: 392kB [00:00, 70.6MB/s]
2024-12-06 16:29:17 INFO: Downloaded file to /home/dvrogozh/stanza_resources/resources.json
2024-12-06 16:29:17 WARNING: Language en package default expects mwt, which has been added
2024-12-06 16:29:17 INFO: Loading these models for language: en (English):
===============================
| Processor | Package         |
-------------------------------
| tokenize  | combined        |
| mwt       | combined        |
| pos       | combined_charlm |
===============================

2024-12-06 16:29:17 INFO: Using device: xpu
2024-12-06 16:29:17 INFO: Loading: tokenize
2024-12-06 16:29:18 INFO: Loading: mwt
2024-12-06 16:29:18 INFO: Loading: pos
/home/dvrogozh/git/pytorch/pytorch/torch/_weights_only_unpickler.py:515: UserWarning: Detected pickle protocol 3 in the checkpoint, which was not the default pickle protocol used by `torch.load` (2). The weights_only Unpickler might not support all instructions implemented by this protocol, please file an issue for adding support if you encounter this.
  warnings.warn(
Traceback (most recent call last):
  File "/home/dvrogozh/tmp/st.py", line 3, in <module>
    pos_pipeline = stanza.Pipeline(lang='en', processors='tokenize,pos', use_gpu=True, device='xpu')
  File "/home/dvrogozh/git/stanza/stanza/pipeline/core.py", line 308, in __init__
    self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config,
  File "/home/dvrogozh/git/stanza/stanza/pipeline/processor.py", line 193, in __init__
    self._set_up_model(config, pipeline, device)
  File "/home/dvrogozh/git/stanza/stanza/pipeline/pos_processor.py", line 32, in _set_up_model
    self._trainer = Trainer(pretrain=self.pretrain, model_file=config['model_path'], device=device, args=args, foundation_cache=pipeline.foundation_cache)
  File "/home/dvrogozh/git/stanza/stanza/models/pos/trainer.py", line 34, in __init__
    self.load(model_file, pretrain, args=args, foundation_cache=foundation_cache)
  File "/home/dvrogozh/git/stanza/stanza/models/pos/trainer.py", line 174, in load
    emb_matrix = pretrain.emb
  File "/home/dvrogozh/git/stanza/stanza/models/common/pretrain.py", line 50, in emb
    self.load()
  File "/home/dvrogozh/git/stanza/stanza/models/common/pretrain.py", line 56, in load
    data = torch.load(self.filename, lambda storage, loc: storage)
  File "/home/dvrogozh/git/pytorch/pytorch/torch/serialization.py", line 1480, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray._reconstruct was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_reconstruct])` or the `torch.serialization.safe_globals([_reconstruct])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
(pytorch.xpu) dvrogozh@willow-spr03:~/tmp$ cat st.py
import stanza

pos_pipeline = stanza.Pipeline(lang='en', processors='tokenize,pos', use_gpu=True, device='xpu')
sentence = "Some sentence"
pos_pipeline(sentence)

dvrogozh avatar Dec 07 '24 00:12 dvrogozh

Got it, but that's the main branch. The updates merged in are in the dev branch, which at that line has torch.load(... weights_only=True)

https://github.com/stanfordnlp/stanza/blob/5754ec0488636e90cdab26f43d44583d4efc99f0/stanza/models/common/pretrain.py#L60

AngledLuffa avatar Dec 07 '24 00:12 AngledLuffa

Got it, but that's the main branch. The updates merged in are in the dev branch, which at that line has torch.load(... weights_only=True)

Ah, sorry. I missed that.

dvrogozh avatar Dec 07 '24 00:12 dvrogozh

This should now be pushed in v1.10.0

AngledLuffa avatar Dec 23 '24 07:12 AngledLuffa

This should now be pushed in v1.10.0 this problem still exists in 1.10.1 if I use zh-hans 1.9.0. And if i used zh-hans 1.10.0, this error arised:

ValueError: md5 for D:\Anaconda\envs\stanza\Lib\site-packages\stanza\stanza_resources\zh-hans\tokenize\gsdsimp.pt is 48f993223d568afedc2893f7cd76719c, expected 68fb709f2a556b132b4915f2b3893ce7

YuhuYang avatar Jan 27 '25 15:01 YuhuYang

It sounds like you have the old models on your system and aren't downloading them. Which is weird, since the Pipeline should automatically download the models for the new version. Are you creating the Pipeline in a way that stops it from downloading?

AngledLuffa avatar Jan 27 '25 15:01 AngledLuffa

thanks for you reply! this is the whole part of my script:

import stanza import os

nlp = stanza.Pipeline(lang='zh-hans')

input_path = r"cleanned" output_path = r"parsed" text_name = [file.split('.')[0] for file in os.listdir(input_path) if file.endswith('.txt')] text_list = [os.path.join(input_path, file) for file in os.listdir(input_path) if file.endswith('.txt')] texts = [open(i, "r", encoding='utf-8') for i in text_list]

for i in range(len(texts)): doc = nlp(texts[i].read())

打印CoNLL-U格式的输出

with open(output_path + "/" + text_name[i] + ".conllu", "w", encoding='utf-8') as f: for sentence in doc.sentences: for word in sentence.words: f.write(f"{word.id}\t{word.text}\t{word.lemma}\t{word.upos}\t{word.xpos}\t{word.feats}\t{word.head}\t{word.deprel}\t{word.deps}\t{word.misc}\n") f.write("\n")

I have tried to download 1.10.1,1.9.0,1.8.0 manually. But this bug still raised. and if i use old models, it will show: 2025-01-27 23:27:30 INFO: Using device: cpu 2025-01-27 23:27:30 INFO: Loading: tokenize 2025-01-27 23:27:30 ERROR: Cannot load model from D:\Anaconda\envs\stanza\Lib\site-packages\stanza\stanza_resources\zh-hans\tokenize\gsdsimp.pt Traceback (most recent call last): File "d:\papers_2\cooperation\华语树库\stanza_parse.py", line 7, in nlp = stanza.Pipeline(lang='zh-hans') File "D:\Anaconda\envs\stanza\lib\site-packages\stanza\pipeline\core.py", line 308, in init self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config, File "D:\Anaconda\envs\stanza\lib\site-packages\stanza\pipeline\processor.py", line 193, in init self._set_up_model(config, pipeline, device) File "D:\Anaconda\envs\stanza\lib\site-packages\stanza\pipeline\tokenize_processor.py", line 44, in _set_up_model self._trainer = Trainer(model_file=config['model_path'], device=device) File "D:\Anaconda\envs\stanza\lib\site-packages\stanza\models\tokenization\trainer.py", line 20, in init self.load(model_file) File "D:\Anaconda\envs\stanza\lib\site-packages\stanza\models\tokenization\trainer.py", line 84, in load checkpoint = torch.load(filename, lambda storage, loc: storage, weights_only=True) File "D:\Anaconda\envs\stanza\lib\site-packages\torch\serialization.py", line 1383, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. (1) Re-running torch.load with weights_only set to False will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. (2) Alternatively, to load with weights_only=True please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL builtins.set was not an allowed global by default. Please use torch.serialization.add_safe_globals([set]) to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

在 2025-01-27 23:45:58,"John Bauer" @.***> 写道:

It sounds like you have the old models on your system and aren't downloading them. Which is weird, since the Pipeline should automatically download the models for the new version. Are you creating the Pipeline in a way that stops it from downloading?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

YuhuYang avatar Jan 27 '25 15:01 YuhuYang

highly recommend using backticks ` to format code

when the script is running, does it say something like

Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES

it should. the md5sum for the tokenizer model for version 1.10 is 48f993223d568afedc2893f7cd76719c, so it looks like you downloaded the right model but somehow didn't download the resources. are you able to download this file?

https://github.com/stanfordnlp/stanza-resources/blob/main/resources_1.10.0.json

the block for the Chinese models is here:

https://github.com/stanfordnlp/stanza-resources/blob/f06522caadca99c72200e20ee158fe5e63b75e97/resources_1.10.0.json#L12350

your local version of the resources file should look like that

AngledLuffa avatar Jan 27 '25 15:01 AngledLuffa

now it is working! it should be i used the wrong resources.json. Thank you very much!

At 2025-01-27 23:59:53, "John Bauer" @.***> wrote:

highly recommend using backticks ` to format code

when the script is running, does it say something like

Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES

it should. the md5sum for the tokenizer model for version 1.10 is 48f993223d568afedc2893f7cd76719c, so it looks like you downloaded the right model but somehow didn't download the resources. are you able to download this file?

https://github.com/stanfordnlp/stanza-resources/blob/main/resources_1.10.0.json

the block for the Chinese models is here:

https://github.com/stanfordnlp/stanza-resources/blob/f06522caadca99c72200e20ee158fe5e63b75e97/resources_1.10.0.json#L12350

your local version of the resources file should look like that

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

YuhuYang avatar Jan 27 '25 16:01 YuhuYang