ExplainaBoard
ExplainaBoard copied to clipboard
Add customized feature function
This PR aims to make feature functions customizable, either through build-in or build-out definitions. For example,
(1) Build-out
loader = get_loader_class(TaskType.text_classification).from_datalab(
dataset=DatalabLoaderOption(
"sst2",
custom_features={
"long_text_50": {
"dtype": "string",
"description": "whether a text is long",
"num_buckets": 2,
"func": "lambda x:'Long Text' if "
"len(x['text'].split()) > 50 "
"else 'Short Text'",
}
},
),
output_data=os.path.join(self.artifact_path, "output_sst2.txt"),
output_source=Source.local_filesystem,
output_file_type=FileType.text,
)
(2) Build-in (dataset_custom_features.json
)
{
"sst2": {
"label": {
"dtype": "string",
"description": "the true label",
"num_buckets": 2
},
"text_len": {
"dtype": "float",
"description": "text length",
"num_buckets": 4,
"func": "lambda x:len(x['text'].split())"
},
"long_text": {
"dtype": "string",
"description": "whether a text is long",
"num_buckets": 2,
"func": "lambda x:'Long Text' if len(x['text'].split()) > 20 else 'Short Text'"
}
}
}
Caveat:
- Some modifications that affect downstream applications: change the type of
custom_features
from list to dict - only support named datasets (datasets in datalab)
- the feature function is not able to take system predictions into account. For example, we cannot define the length of generated summaries.
Some interesting things we can do in the future:
- calculate more features for named datasets and store them using DataLab SDK
- maybe it could be in private repo/db
- allow users to customize feature functions on the fly in explainaboard web
- for each dataset, we need to maintain and display its
column
information (i.e., dataset features in datalab SDK)
- for each dataset, we need to maintain and display its
- so far, the customized feature functions are not able to be applied to
system predictions
, which should be supported later
I think this is a great idea, but would it be OK to defer this for a little bit for two reasons?
- I'd like to try to merge in version 0.11 as soon as possible to prevent it from diverging too much, and this hits some core functionality that version 0.11 is also trying to handle.
- I'd like to think seriously about whether there are security implications of this, as if this is used carelessly it could result in arbitrary code being specified in strings and being run.
My goal is to finish upgrading the main branch of ExplainaBoard to v0.11 by the end of this week, so would it be OK to revisit this then?
I think this is a great idea, but would it be OK to defer this for a little bit for two reasons?
- I'd like to try to merge in version 0.11 as soon as possible to prevent it from diverging too much, and this hits some core functionality that version 0.11 is also trying to handle.
- I'd like to think seriously about whether there are security implications of this, as if this is used carelessly it could result in arbitrary code being specified in strings and being run.
My goal is to finish upgrading the main branch of ExplainaBoard to v0.11 by the end of this week, so would it be OK to revisit this then?
Yeah, I think that's fine for me.
"I'd like to think seriously about whether there are security implications of this, as if this is used carelessly it could result in arbitrary code being specified in strings and being run."
Hi, @neubig I just realized that if we only have this functionality in ExplainaBoard while don't create a specific interface for explainaboard web, then there will be a little security problem.
Hey @pfliu-nlp, now that we've merged the code I mentioned above I think we can come back to this! One thing that will need to be done is a merge of the current main branch of course.
if we only have this functionality in ExplainaBoard while don't create a specific interface for explainaboard web, then there will be a little security problem
I somewhat agree but we not only need to be certain that this won't happen now, but also be sure it won't happen in the future. For example, I believe it is currently possible to specify custom features through metadata in the system output file that could be uploaded to the web interface, and I think these custom features could also probably be specified through that file, right? Also, we might add new options for how to specify custom features in the future, and if the people implementing this new functionality are not aware of this potential security risk they might accidentally expose this interface.
Two safer options would be:
- Never parse a string to code and execute it. This would allow new functions to be specified programmatically, but not through configuration files.
- Do sandboxing of any parsed strings, which is possible but quite tricky to get right: https://stackoverflow.com/questions/3068139/how-can-i-sandbox-python-in-pure-python