armory icon indicating copy to clipboard operation
armory copied to clipboard

tfdsv4 resisc45

Open lcadalzo opened this issue 2 years ago • 7 comments

lcadalzo avatar Nov 30 '22 20:11 lcadalzo

@davidslater do you recall why the dataset gets built to <data_dir>/resisc45_split rather than <data_dir>/resisc45? This discrepancy makes load.load() think the dataset doesn't exist locally, since it's looking for the path <data_dir>/resisc45

build command:

python -m armory.datasets.build resisc45

lcadalzo avatar Nov 30 '22 20:11 lcadalzo

It was because the original dataset did not contain splits. However, I think that we're fine to just call it resisc45.

You will want to incorporate resisc45_dataset_partition.py into your _generate_examples method, I think.

davidslater avatar Nov 30 '22 21:11 davidslater

In armory/data/resisc45/resisc45_split.py, which I've copied to armory/datasets, the data is already split into 3 tar files. I'm able to build the data (with splits) fine without touching resisc45_dataset_partition.py, but the data is built to resisc45_split dir:

I have no name!@b9bf7b1fac00:/workspace$ ls /armory/datasets/new_builds/resisc45_split/3.0.0/
dataset_info.json  resisc45_split-test.tfrecord-00000-of-00001   resisc45_split-train.tfrecord-00002-of-00004
features.json      resisc45_split-train.tfrecord-00000-of-00004  resisc45_split-train.tfrecord-00003-of-00004
label.labels.txt   resisc45_split-train.tfrecord-00001-of-00004  resisc45_split-validation.tfrecord-00000-of-00001


I have no name!@b9bf7b1fac00:/workspace$ ls /armory/datasets/new_builds/resisc45             
ls: cannot access '/armory/datasets/new_builds/resisc45': No such file or directory

if I change the dataset "name" in the config from "resisc45" to "resisc45_split", I can load the data fine. But with the name "resisc45", load.load() looks for /armory/datasets/new_builds/resisc45 and throws an error.

lcadalzo avatar Nov 30 '22 21:11 lcadalzo

Those URLs are what results from applying the resisc45_dataset_partition.py script to the original dataset NWPU-RESISC45.tar.gz and then breaking into separate files.

I think that we probably want to just reference armory-public-data/resisc45/NWPU-RESISC45.tar.gz and incorporate the script into the builder loop. Thoughts?

I can take a stab at it if you'd like.

davidslater avatar Nov 30 '22 21:11 davidslater

after the above commit, I can build to resisc45 dir without error, and this no longer uses the 3 separate files. Not quite sure why I needed to add the hardcoded NWPU-RESISC45 in a couple places to get things working, though

lcadalzo avatar Nov 30 '22 23:11 lcadalzo

Builds fine for me.

davidslater avatar Dec 02 '22 17:12 davidslater

Removing the WIP, I'm able to run a scenario and see expected benign/adv output. @davidslater ready for re-review

lcadalzo avatar Dec 02 '22 18:12 lcadalzo

done. Calling add_to_cache() also attempts to upload to s3, although this yielded an error for me. Not before cached_datasets.json was updated, though. Have you been using upload() successfully? I noticed that all the url's in the json are null

lcadalzo avatar Dec 02 '22 21:12 lcadalzo

Where did it error? Do you have ARMORY_PRIVATE_S3_ID and ARMORY_PRIVATE_S3_KEY?

davidslater avatar Dec 02 '22 21:12 davidslater

You can break down that operation with:

from armory.datasets import package, upload
package.package("resisc45")
package.update("resisc45")
package.verify("resisc45")
upload.upload("resisc45", public=True)

davidslater avatar Dec 02 '22 21:12 davidslater