armory Define what to do with checksums in new architecture

Define what to do with checksums in new architecture

Open shenshaw26 opened this issue 2 years ago • 3 comments

For TFDS4 we are more tightly controlling the dataset generation process where what is provided to the armory-engine is a directory tree that contains tfrecord files and the corresponding meta-data json files. Also, the construction of datasets now lives in armory.tools.dataset_builder and therefore is decoupled from the main armory.data module. Finally, since we are using the tfds.core.builder_from_directory method, the "need" for the checksums is largely removed (at least from the perspective of TFDS).

In this new paradigm, the question becomes what do we want to do with checksums? I imagine there is still some desire to be able to check the checksums when armory downloads the .tar.gz files from armory s3, but is there anything else?

Initial Proposal:

Move checksum files to armory s3 bucket so that armory can pull down the necessary checksum when downloading the dataset
Refactor armory.data.datasets to: 1. Attempt to load dataset from local directory (contain tfrecords and metadata *.json files) 2. if not there, attempt to unpack from local cache and then load 3. if not there, attempt to download the cache .tar.gz file from armory s3, check checksum, unpack and load 4. if not there, error out.

@davidslater, thoughts??

Apr 05 '22 14:04 shenshaw26

Are you asking about url_checksums or s3_checksums? The latter, yes? The former are in the dataset-builder?

Apr 05 '22 14:04 davidslater

So, i don't think our new builder "requires" the "url_checksums". For example, I build a clean docker image with just the requirements.txt and the dataset_builder code and it is able to download and build both mnist and digit (just examples I tried) without any reference to url_checksums. Is there something I am missing?

Apr 05 '22 17:04 shenshaw26

Looking at the TFDS CLI, it creates a checksums.tsv file in the data directory, so maybe that is used for these instead of the url_checksums?

Apr 05 '22 20:04 davidslater

armory armory copied to clipboard

Define what to do with checksums in new architecture

armory
armory copied to clipboard