gli icon indicating copy to clipboard operation
gli copied to clipboard

[FEATURE REQUEST] Add a test that verifies the md5 value of each npz file.

Open jiaqima opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe.

Since we plan to include the md5 value in the file name of each npz file, we can add a test to verify the file. This will help avoid malicious attack by someone trying to upload a file with the same name of an existing file but with different content.

Describe the solution you'd like

The test can be added in two levels.

  1. The pytest level. Whenever there is a change in a dataset folder in a new PR, we can verify all the npz files included in the metadata.json and task json files.
  2. The dataloader level. We could also add assertion in the dataloading functions (maybe only do this when the files are downloaded). But this will break the current datasets with old file naming. So this can only be done after all the datasets are updated. This will also increase the computation overhead a little bit, which I'm not sure is worth or not.

1 is redundant if 2 is implemented.

jiaqima avatar Mar 30 '23 15:03 jiaqima