ExplainaBoard icon indicating copy to clipboard operation
ExplainaBoard copied to clipboard

Add customized feature function

Open pfliu-nlp opened this issue 2 years ago • 3 comments

This PR aims to make feature functions customizable, either through build-in or build-out definitions. For example,

(1) Build-out
        loader = get_loader_class(TaskType.text_classification).from_datalab(
            dataset=DatalabLoaderOption(
                "sst2",
                custom_features={
                    "long_text_50": {
                        "dtype": "string",
                        "description": "whether a text is long",
                        "num_buckets": 2,
                        "func": "lambda x:'Long Text' if "
                        "len(x['text'].split()) > 50 "
                        "else 'Short Text'",
                    }
                },
            ),
            output_data=os.path.join(self.artifact_path, "output_sst2.txt"),
            output_source=Source.local_filesystem,
            output_file_type=FileType.text,
        )

(2) Build-in (dataset_custom_features.json)

{
  "sst2": {
    "label": {
      "dtype": "string",
      "description": "the true label",
      "num_buckets": 2
    },
    "text_len": {
      "dtype": "float",
      "description": "text length",
      "num_buckets": 4,
      "func": "lambda x:len(x['text'].split())"
    },
    "long_text": {
      "dtype": "string",
      "description": "whether a text is long",
      "num_buckets": 2,
      "func": "lambda x:'Long Text' if len(x['text'].split()) > 20 else 'Short Text'"
    }
  }
}

Caveat:

  • Some modifications that affect downstream applications: change the type of custom_features from list to dict
  • only support named datasets (datasets in datalab)
  • the feature function is not able to take system predictions into account. For example, we cannot define the length of generated summaries.

Some interesting things we can do in the future:

  • calculate more features for named datasets and store them using DataLab SDK
    • maybe it could be in private repo/db
  • allow users to customize feature functions on the fly in explainaboard web
    • for each dataset, we need to maintain and display its column information (i.e., dataset features in datalab SDK)
  • so far, the customized feature functions are not able to be applied to system predictions, which should be supported later

pfliu-nlp avatar Jul 30 '22 22:07 pfliu-nlp

I think this is a great idea, but would it be OK to defer this for a little bit for two reasons?

  1. I'd like to try to merge in version 0.11 as soon as possible to prevent it from diverging too much, and this hits some core functionality that version 0.11 is also trying to handle.
  2. I'd like to think seriously about whether there are security implications of this, as if this is used carelessly it could result in arbitrary code being specified in strings and being run.

My goal is to finish upgrading the main branch of ExplainaBoard to v0.11 by the end of this week, so would it be OK to revisit this then?

neubig avatar Jul 31 '22 08:07 neubig

I think this is a great idea, but would it be OK to defer this for a little bit for two reasons?

  1. I'd like to try to merge in version 0.11 as soon as possible to prevent it from diverging too much, and this hits some core functionality that version 0.11 is also trying to handle.
  2. I'd like to think seriously about whether there are security implications of this, as if this is used carelessly it could result in arbitrary code being specified in strings and being run.

My goal is to finish upgrading the main branch of ExplainaBoard to v0.11 by the end of this week, so would it be OK to revisit this then?

Yeah, I think that's fine for me.

pfliu-nlp avatar Aug 01 '22 23:08 pfliu-nlp

"I'd like to think seriously about whether there are security implications of this, as if this is used carelessly it could result in arbitrary code being specified in strings and being run."

Hi, @neubig I just realized that if we only have this functionality in ExplainaBoard while don't create a specific interface for explainaboard web, then there will be a little security problem.

pfliu-nlp avatar Aug 15 '22 00:08 pfliu-nlp

Hey @pfliu-nlp, now that we've merged the code I mentioned above I think we can come back to this! One thing that will need to be done is a merge of the current main branch of course.

if we only have this functionality in ExplainaBoard while don't create a specific interface for explainaboard web, then there will be a little security problem

I somewhat agree but we not only need to be certain that this won't happen now, but also be sure it won't happen in the future. For example, I believe it is currently possible to specify custom features through metadata in the system output file that could be uploaded to the web interface, and I think these custom features could also probably be specified through that file, right? Also, we might add new options for how to specify custom features in the future, and if the people implementing this new functionality are not aware of this potential security risk they might accidentally expose this interface.

Two safer options would be:

  1. Never parse a string to code and execute it. This would allow new functions to be specified programmatically, but not through configuration files.
  2. Do sandboxing of any parsed strings, which is possible but quite tricky to get right: https://stackoverflow.com/questions/3068139/how-can-i-sandbox-python-in-pure-python

neubig avatar Aug 18 '22 01:08 neubig