transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Auto-download is a security hole.

Open freckletonj opened this issue 2 years ago • 2 comments

System Info

I just ran a project and it decided to download a completely unrelated dataset, which I didn't want or need. The extraneous download was https://huggingface.co/datasets/allenai/c4, which upon inspection contains 800+ trojan viruses. Are these false positives? I shouldn't have to care unless I'm interested in this specific dataset.

I think any network calls should be strictly opt-in, eg pehaps HF_NETWORK_ALLOWED=True python whatever.py

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Steps to reproduce:

  1. Run any HF model for the first time. It will make network calls, and download datasets and weights.

Expected behavior

0 network calls are made, unless opted in to.

freckletonj avatar Apr 23 '23 01:04 freckletonj

Hi @freckletonj, thanks for raising this issue.

Without knowing which code you're running, it's hard to know what specifically triggered the dataset download (or how unrelated it is). Typically, a dataset would be downloaded if it's requested through the load_dataset functionality. However, I see that allenai/c4 dataset needs to be downloaded through git clone. In general, if you've spotted malicious content within a dataset, I'd recommend flagging on the repo (there's already an open discussion here)

You can run transformers in a firewalled or offline mode setting TRANSFORMERS_OFFLINE=1 in your environment. For datasets, this is HF_DATASETS_OFFLINE=1. See: https://huggingface.co/docs/transformers/v4.28.1/en/installation#offline-mode.

amyeroberts avatar Apr 24 '23 11:04 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 23 '23 15:05 github-actions[bot]