croissant
croissant copied to clipboard
Add extra safeguards for basic auth
As part of #245, basic auth support was added for the fetching of raw data files. During a review of using mlcroissant for data loading functionality, we realized that there's a data exfiltration vulnerability. The exploit could look something like this:
- Bad actor hosts a properly formatted Croissant JSON-LD. Maybe from a reputable-seeming location like a public GCS/S3 URL.
- The
distributionproperty in the JSON-LD containscontentUrlvalues that point to a server under their control. - Bad actor publishes an example notebook of how to use Croissant to load data. It would:
- Describe how to set the
CROISSANT_BASIC_AUTH_*values to access a private dataset onkaggle.com, where victims would provide their own username and API key. - Use some means to load the private Kaggle data--either via the load script or by doing something like
mlc.Dataset(private_kaggle_dataset_url).records(record_set=name_of_tabular_file)--which would use the Kaggle credentials in the requests for data. Note that this step isn't necessary for exfiltration, but establishes a reason for why the victim's credentials would be needed in the first place. - Without clearing environment variables, then load data from the from the bad actor's Croissant. Because of the presence of the same values in the environment variables, the victim's credentials would be sent to the server under the attacker's control. The attacker could store those credentials and use them at a later time. If the attacker's server returns data that allows the notebook to complete as expected, this could go undetected for a while, allowing the attacker to amass credentials for several victims.
- Kaggle is the example here, but this could be the case for other data providers that support basic auth as a means for retrieving data behind ACLs.
- Describe how to set the
Solutions:
- We could introduce a 3rd new environment variable to serve as a safeguard. Something like
CROISSANT_BASIC_AUTH_HOSTcould be used to ensure that requests for data are only permitted to use theCROISSANT_BASIC_AUTH_(USERNAME|PASSWORD)for URLs that match on that domain.- This is maybe preferred since we already have environment-based authentication and the implementation is simple
- Authentication values could be accepted at the point of loading data so that each instantiation/invocation supplies the values rather than reading from the environment.
- Aside from being more complicated, this approach is also slightly riskier: it would require changing more of the code (higher chance of regression) and because of the top-down approach of passing in values, there's a higher risk of missing a path/branch.