transformers
transformers copied to clipboard
Auto-download is a security hole.
System Info
I just ran a project and it decided to download a completely unrelated dataset, which I didn't want or need. The extraneous download was https://huggingface.co/datasets/allenai/c4, which upon inspection contains 800+ trojan viruses. Are these false positives? I shouldn't have to care unless I'm interested in this specific dataset.
I think any network calls should be strictly opt-in, eg pehaps HF_NETWORK_ALLOWED=True python whatever.py
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Steps to reproduce:
- Run any HF model for the first time. It will make network calls, and download datasets and weights.
Expected behavior
0 network calls are made, unless opted in to.
Hi @freckletonj, thanks for raising this issue.
Without knowing which code you're running, it's hard to know what specifically triggered the dataset download (or how unrelated it is). Typically, a dataset would be downloaded if it's requested through the load_dataset functionality. However, I see that allenai/c4 dataset needs to be downloaded through git clone. In general, if you've spotted malicious content within a dataset, I'd recommend flagging on the repo (there's already an open discussion here)
You can run transformers in a firewalled or offline mode setting TRANSFORMERS_OFFLINE=1 in your environment. For datasets, this is HF_DATASETS_OFFLINE=1. See: https://huggingface.co/docs/transformers/v4.28.1/en/installation#offline-mode.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.