armory
armory copied to clipboard
Define what to do with checksums in new architecture
For TFDS4 we are more tightly controlling the dataset generation process where what is provided to the armory-engine
is a directory tree that contains tfrecord files and the corresponding meta-data json files. Also, the construction of datasets now lives in armory.tools.dataset_builder
and therefore is decoupled from the main armory.data
module. Finally, since we are using the tfds.core.builder_from_directory
method, the "need" for the checksums is largely removed (at least from the perspective of TFDS).
In this new paradigm, the question becomes what do we want to do with checksums? I imagine there is still some desire to be able to check the checksums when armory downloads the .tar.gz
files from armory s3, but is there anything else?
Initial Proposal:
- Move checksum files to armory s3 bucket so that armory can pull down the necessary checksum when downloading the dataset
- Refactor
armory.data.datasets
to: 1. Attempt to load dataset from local directory (contain tfrecords and metadata *.json files) 2. if not there, attempt to unpack from local cache and then load 3. if not there, attempt to download the cache.tar.gz
file from armory s3, check checksum, unpack and load 4. if not there, error out.
@davidslater, thoughts??
Are you asking about url_checksums
or s3_checksums
? The latter, yes? The former are in the dataset-builder?
So, i don't think our new builder "requires" the "url_checksums". For example, I build a clean docker image with just the requirements.txt and the dataset_builder code and it is able to download and build both mnist
and digit
(just examples I tried) without any reference to url_checksums. Is there something I am missing?
Looking at the TFDS CLI, it creates a checksums.tsv
file in the data directory, so maybe that is used for these instead of the url_checksums?