Handling sensitive data sent to remote services
With the introductions of metrics that can send data to remote services - one needs a safe way to avoid accidentally sending propriety/confidential data to external services.
In the common case in unitxt metrics and datasets are developed by different people, who may not be aware of each other and the implementations, this becomes extremely error prone.
To address this we need a way for
- the dataset owner to specify for each dataset (or instance) the data classification. The taxonomy should be defined by the user.
- the metric owner should define whether the metric is safe for all data (e.g. running locally) or allow the user of the metric to specify which data classification are allowed to be used in the metric.
Suggested approach: Each loader, will have an additional list[str] parameter called 'data_classification' . Different loaders can have difficult default. For example, LoadHF can be set the default to "public", while another Loader can set it to "propriety" . The user can override these for specific datasets , e.g. "PII".
loader=LoadFromIBMCloud(
endpoint_url_env="MY_COS_URL",
aws_access_key_id_env=MY_COS_ACCESS_KEY_ID",
aws_secret_access_key_env="MY_COS_SECRET_ACCESS_KEY",
bucket_name="...",
data_dir=....",
data_files=["train.jsonl", "test.jsonl"],
data_classification=["propriety","pii"]
),
The loaders will add the list as a field to all the instances in the loaded datasets.
Each base metric class will check in the compute() function that all instance data classifications are allowed by check_allowed_data_classification(instance) .
The default implementation of check_allowed_data_classification, will check a metric specific environment variable, for the list of allowed data classification.
If not, the an error message of this type will be generated.
"The following instance has data classification of '{instance_data_classification}', however the {metric} is only configured to support the following data with classification '{allowed_data_classification}.' To allow, this set the enviromment variable {env_var} to include '{instance_data_classification}',"
@elronbandel @eladven @perlitz - Please review.
I agree @yoavkatz . This is a good solution.