lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Security features from the Hugging Face datasets library
In the next version of datasets
we are adding a new argument trust_remote_code
to let users decide or not to trust the dataset scripts in python on Hugging Face. The default in the next release will still be trust_remote_code=True
, but in the coming months we plan to do a major release and set it to False
by default.
Most datasets in lm-evaluation-harness
are defined on HF using dataset scripts, and may require passing trust_remote_code=True
.
Those datasets could also be converted to e.g. JSONL or Parquet datasets on HF (e.g. like Muennighoff/babi).
Let me know if you have questions on this change and if I can help making a smooth transition.
Ideally this should make lm-evaluation-harness
safer for people using datasets defined with python code.
cc @haileyschoelkopf
Thanks, @lhoestq ! Will see about the smoothest way to make this non-invasive for users while still having them acknowledge that they may be running untrusted code.
I'm happy to help on this issue if it's something you'd still like for lm-evaluation-harness to be able to bump datasets
.
I see we include it as a parameters when initialing the constructor for HFLM
.
The steps I had in mind:
- Set the flag to
True
- If running a huggingface model, include a Y/n at the CLI run level after the user triggers a model run.
We mentioned this may be invasive for users and introduces an extra level of friction that they don't usually need to respond to cli prompts to continue when using lm-evaluation-harness - perhaps a log line entry is enough here?
Let me know what you think @haileyschoelkopf
Hi @veekaybee , this'd be awesome! I agree that since the current behavior is to run datasets from HF datasets, remote code or no, that manually setting it to True
so we can update datasets
wouldn't be a regression in safety, but it'd be nice if there's a way to one-time opt-in per env install to trust_remote_code or something, and subsequently afterward override to trust_remote_code=True
as you're suggesting.
Maybe @lhoestq has suggestions for how to non-invasively incorporate this while not making it too much friction for users?
I'm taking a look at how some other libraries handle this:
- For OpenLLM, it's passed as an environment variable on invocation and here
- vLLM passes it as a dataclass of config args
Perhaps we could do some variation of the first one, where we indicate that users should set it in the docs and then check the environment variable when a simple evaluation for HFLM is instantiated ?
Hmmm... I think for models, allowing it to be passed through --model_args
as we do now for vllm and HFLM doesn't seem like a problem to users (though if there's something better that can switch its default to True like an env var open to it for sure!)--but it seems nonideal to have a --dataset_args
or --trust_remote_code
standalone switch for Datasets' trust_remote_code kwarg that people will possibly need to always have on for certain tasks.
It would be great to make the list of tasks that require trust_remote_code=True
, maybe using a code like this ?
from datasets.inspect import get_dataset_infos
task_that_require_trust_remote_code = []
for task in tasks:
try:
get_dataset_config_names(task.DATASET_PATH, trust_remote_code=False)
except ValueError e:
if "trust_remote_code=True" in str(e):
task_that_require_trust_remote_code.append(task)
Maybe for all the tasks that we can already trust we can add trust_remote_code: true
in their YAML (e.g. the ones maintained by EAI or HF), and require --trust_remote_code
for the rest ?
Note that there is also an environment variable HF_DATASETS_TRUST_REMOTE_CODE
that you can set to "1" to always trust remote code by default, if it can help. It can also be dynamically set using datasets.config.HF_DATASETS_TRUST_REMOTE_CODE = True
I want to summarize so far for my own understanding, let me know if I'm missing or mischaracterizing anything:
Datasets is HuggingFace's library for loading datasets for model training/tuning. Some of these datasets require running extra scripts to fully populate the data.
Datasets
The current behavior for load_dataset
, which loads files + any scripts to populate files, is that the trust_remote_code
is initialized in the object constructor and currently set to True.
There is a change in the new version that would allow users to pass this explicitly. when calling load_dataset
, and it defaults to True
, as before, however it will be set to False
.
LM-Harness
Within datasets used for evaluation, such as squadv2
, we call the datasets
load_dataset
function, example here.
This means when we run evaluation, regardless of what model type we are evaluating, the task is run and the script for the task is run. For example:
lm_eval --model hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks squadv2 \
--batch_size 16
will trigger this issue or warning (if squadv2 is a dataset with scripts that need to be executed)
In order to understand when this would be a user issue, we'd like to understand which datasets actually require True
, which we can get from dataset metadata, and based on that, we could potentially pass it here, like this:
lm_eval --model hf \
--model_args pretrained=EleutherAI/gpt-j-6B, trust_remote_code=True \
--tasks squadv2 \
--batch_size 16
which would mean it gets picked up from the configuration file.
@veekaybee This is correct!
@lhoestq thank you, that environment variable is quite helpful--exactly what I was hoping might exist!
In this case, then, would it make sense to do the following:
In the config:
- Add a new field in the task YAML files based on the inventory suggested above,
trust_remote_code
to the TaskConfig dict
Within the Model constructor:
- Set the parameter in the constructor to
None
- Set trust_remote_code here, to be something like:
datasets.config.HF_DATASETS_TRUST_REMOTE_CODE = True
self.trust_remote_code = trust_remote_code if trust_remote_code is not None else os.envget('HF_DATASETS_TRUST_REMOTE_CODE')
- Add a log warning, "you are trusting remote code for this task/model run" if the code in 2 gets called
A couple things I'm not clear on: Should we set the environment variable from the TaskConfig directly or once we instantiate the class? And also, how the TaskConfig dict gets passed to the HFLM class- can we call it directly?
So to clarify, both transformers
and datasets
now have trust_remote_code=True but that doesn't necessarily mean we have to link the behaviors. We currently support trust remote code (and default to False) for HF transformers
models, but don't support specifying this option for datasets
. Related to this: LM classes are never passed or provided arguments based on TaskConfig
classes, as TaskConfigs only are used for task / dataset related arguments, not model-related arguments.
In the config:
- We can add
trust_remote_code
todataset_kwargs
which will get passed to load_dataset() , so don't need a whole extra field!
dataset_kwargs:
trust_remote_code: true
This should behave as anticipated, I think, as well in the absence of trust_remote_code
in the config--it should then work based on the user's HF_DATASETS_TRUST_REMOTE_CODE.
In the Model:
I think it's okay to preserve the current behavior, at least as one option (allowing --model_args trust_remote_code=True/False
). I think if we do add a --trust_remote_code
CLI flag for HF datasets, it'd make sense for the CLI flag turning on trust_remote_code to also override and pass trust_remote_code=True
to the model constructors though.
Hi ! FYI the datasets
release that requires explicit trust_remote_code=True
for datasets with a python loading script will be out today. Feel free to ping me if you encounter any issue with your CI or from users !