armory
armory copied to clipboard
Refactor datasets
Edited on 11 March 2022
GOAL: Currently armory uses TFDS 3.2.0 which is outdated and makes some tasks (e.g. including local datasets) difficult. This ticket aims to update TFDS to version 4 and incorporate all the necessary downstream changes into armory. NOTE: TFDS4 has some breaking changes that will need to be dealt with.
TASKS:
As a part of this ticket, we would also like to clean up armory.data
as much as we can and make it more clear as to what datasets armory "supports". The following tasks are considered the completion criteria for this ticket:
- [ ] Consider creating and
armory dataset builder
function and split it out from the main armory codebase. NOTE: there is already some tooling here (integrate_tfds.py
,template_boilerplate.py
) but we need to separate it out from everything else. Also worth noting that the TFDS4 cli is pretty nice and may obviate the need for us to build that tooling into armory. - [ ] Refactor
armory.data.datasets
andarmory.data.adversarial_datasets
to make it more streamlined. Maybe consider havingdatasets
return something like(x, None), y
so that the structure works the same? Also, is there ever a case where for adversarial things something like(x, xadv, (y, yadv)
would be appropriate? - [ ] As a part of this cleanup effort, create an
armory.download
module and movedata.model_weights.py
anddata.progress_precentage.py
into that module
Breaking changes are listed here: https://github.com/tensorflow/datasets/releases/tag/v4.0.0
@davidslater, going to be poking on this this weekend and next week... as I go, I would like to add more testing around some of this tooling but wanted to make sure I wasn't missing anything before I make changes.
Some Notes/Questions:
- Is there a reason why we have
adversarial_datasets.py
split out fromdatasets
? - Is there a reason to NOT have a tfds folder structure in the repo for "all" supported datasets? for example I know mnist is supported by tfds by default so we dont need to make it locally, but why not do it just to be more explicit? Or maybe move all the datasets that we have built to a folder called
additional_datasets
and then we can glob that directory for all datasets there to populated a supported list. IF we did that, maybe then it makes sense to have a directory calledadversarial_datasets
that is at same level - why do we have
integrate_tfds.py
in thearmory.data
folder as it appears to be a standalone script... maybe consider making ascripts
folder that we can put these types of things into? - why is
model_weights.py
inarmory.data
? consider moving this type of this toarmory.download
module? That would make it a bit easier to track as we move to anoffline
mode. - same as above for
progress_percentage.py
? - It appears that
template_boilerplate.py
is only used inintegrate_tfds.py
can we just move it there??
Some Notes/Questions:
- Is there a reason why we have
adversarial_datasets.py
split out fromdatasets
? Yes, adversarial_datasets typically return something like(x, x_adv), y
where standard datasets just returnx, y
.
- Is there a reason to NOT have a tfds folder structure in the repo for "all" supported datasets? for example I know mnist is supported by tfds by default so we dont need to make it locally, but why not do it just to be more explicit? Or maybe move all the datasets that we have built to a folder called
additional_datasets
and then we can glob that directory for all datasets there to populated a supported list. IF we did that, maybe then it makes sense to have a directory calledadversarial_datasets
that is at same level Ideally, I would like to support ALL of tfds standard datasets, and just have our additions here. I like the idea of having two big folders: one for each of the adversarial and standard datasets instead of having a separate folder at the base level for each dataset. Structurally, though, I think that we should consider basing our structure off of what is generated by: https://www.tensorflow.org/datasets/cli
However, simply globbing a directory for supported datasets is probably not going to work well, as TFDS requires registration of those datasets, which in turn involves importing the files. We may be able to glob the datasets directory to search for files, import them to register them with TFDS, and then use that populate the list.
I would also like to be clear what I mean as a "supported" dataset. Maybe we should reword that. We should add features to enable supporting user-generated datasets without modifications to the armory code base (or s3 buckets we control). The current set of "supported" datasets are really ones that we use directly in our experiments and have cached versions of them for easy download/use.
- why do we have
integrate_tfds.py
in thearmory.data
folder as it appears to be a standalone script... maybe consider making ascripts
folder that we can put these types of things into? We should look into having a dataset "builder" function that is separate from the main armory. This could go in that or a separate scripts folder. Some of this may become unnecessary if we start using the TFDS v4 CLI.
- why is
model_weights.py
inarmory.data
? consider moving this type of this toarmory.download
module? That would make it a bit easier to track as we move to anoffline
mode.- same as above for
progress_percentage.py
? Yes to both.
- It appears that
template_boilerplate.py
is only used inintegrate_tfds.py
can we just move it there?? Yep. This will likely become deprecated when we revampdatasets.py
.
Thanks for the feedback. I am going to update the description above based on this feedback.
Some follow on thoughts/questions:
- For the datasets vs adv_datasets, could we just move to an output format all around of something like
(x, xadv), (y, yadv)
and then if they are non-adversarial it would return(x, None), (y, None)
? - The thought about dataset folder structure is exactly what I was thinking as well...I will look at tfds cli and play around a little bit.
- Understood about "support" of local datasets....I will give that some thought
I wouldn't want to go to an output format like that as adversarial datasets are not commonly used, and that format would be weird in all other contexts.
I wouldn't want to go to an output format like that as adversarial datasets are not commonly used, and that format would be weird in all other contexts.
What about something like x, y, *otherbits
so then all datasets would have "same" format, and then if you want/need other bits you can always grab it.
No, I would like to keep the base datasets as standard as possible, which is to make them return (x, y)
and only have the adversarial ones do something different.
You can fit the adversarial ones into the (x, y)
format - they can just have more structure, like ((x_ben, x_adv), (y_true, y_target))
Since TFDS v4 is merged in, I wanted to update the name