armory icon indicating copy to clipboard operation
armory copied to clipboard

Refactor datasets

Open davidslater opened this issue 3 years ago • 8 comments

Edited on 11 March 2022

GOAL: Currently armory uses TFDS 3.2.0 which is outdated and makes some tasks (e.g. including local datasets) difficult. This ticket aims to update TFDS to version 4 and incorporate all the necessary downstream changes into armory. NOTE: TFDS4 has some breaking changes that will need to be dealt with.

TASKS: As a part of this ticket, we would also like to clean up armory.data as much as we can and make it more clear as to what datasets armory "supports". The following tasks are considered the completion criteria for this ticket:

  • [ ] Consider creating and armory dataset builder function and split it out from the main armory codebase. NOTE: there is already some tooling here (integrate_tfds.py, template_boilerplate.py) but we need to separate it out from everything else. Also worth noting that the TFDS4 cli is pretty nice and may obviate the need for us to build that tooling into armory.
  • [ ] Refactor armory.data.datasets and armory.data.adversarial_datasets to make it more streamlined. Maybe consider having datasets return something like (x, None), y so that the structure works the same? Also, is there ever a case where for adversarial things something like (x, xadv, (y, yadv) would be appropriate?
  • [ ] As a part of this cleanup effort, create an armory.download module and move data.model_weights.py and data.progress_precentage.py into that module

davidslater avatar Feb 15 '22 22:02 davidslater

Breaking changes are listed here: https://github.com/tensorflow/datasets/releases/tag/v4.0.0

davidslater avatar Feb 18 '22 23:02 davidslater

@davidslater, going to be poking on this this weekend and next week... as I go, I would like to add more testing around some of this tooling but wanted to make sure I wasn't missing anything before I make changes.

Some Notes/Questions:

  • Is there a reason why we have adversarial_datasets.py split out from datasets?
  • Is there a reason to NOT have a tfds folder structure in the repo for "all" supported datasets? for example I know mnist is supported by tfds by default so we dont need to make it locally, but why not do it just to be more explicit? Or maybe move all the datasets that we have built to a folder called additional_datasets and then we can glob that directory for all datasets there to populated a supported list. IF we did that, maybe then it makes sense to have a directory called adversarial_datasets that is at same level
  • why do we have integrate_tfds.py in the armory.data folder as it appears to be a standalone script... maybe consider making a scripts folder that we can put these types of things into?
  • why is model_weights.py in armory.data ? consider moving this type of this to armory.download module? That would make it a bit easier to track as we move to an offline mode.
  • same as above for progress_percentage.py?
  • It appears that template_boilerplate.py is only used in integrate_tfds.py can we just move it there??

shenshaw26 avatar Mar 10 '22 20:03 shenshaw26

Some Notes/Questions:

  • Is there a reason why we have adversarial_datasets.py split out from datasets? Yes, adversarial_datasets typically return something like (x, x_adv), y where standard datasets just return x, y.
  • Is there a reason to NOT have a tfds folder structure in the repo for "all" supported datasets? for example I know mnist is supported by tfds by default so we dont need to make it locally, but why not do it just to be more explicit? Or maybe move all the datasets that we have built to a folder called additional_datasets and then we can glob that directory for all datasets there to populated a supported list. IF we did that, maybe then it makes sense to have a directory called adversarial_datasets that is at same level Ideally, I would like to support ALL of tfds standard datasets, and just have our additions here. I like the idea of having two big folders: one for each of the adversarial and standard datasets instead of having a separate folder at the base level for each dataset. Structurally, though, I think that we should consider basing our structure off of what is generated by: https://www.tensorflow.org/datasets/cli

However, simply globbing a directory for supported datasets is probably not going to work well, as TFDS requires registration of those datasets, which in turn involves importing the files. We may be able to glob the datasets directory to search for files, import them to register them with TFDS, and then use that populate the list.

I would also like to be clear what I mean as a "supported" dataset. Maybe we should reword that. We should add features to enable supporting user-generated datasets without modifications to the armory code base (or s3 buckets we control). The current set of "supported" datasets are really ones that we use directly in our experiments and have cached versions of them for easy download/use.

  • why do we have integrate_tfds.py in the armory.data folder as it appears to be a standalone script... maybe consider making a scripts folder that we can put these types of things into? We should look into having a dataset "builder" function that is separate from the main armory. This could go in that or a separate scripts folder. Some of this may become unnecessary if we start using the TFDS v4 CLI.
  • why is model_weights.py in armory.data ? consider moving this type of this to armory.download module? That would make it a bit easier to track as we move to an offline mode.
  • same as above for progress_percentage.py? Yes to both.
  • It appears that template_boilerplate.py is only used in integrate_tfds.py can we just move it there?? Yep. This will likely become deprecated when we revamp datasets.py.

davidslater avatar Mar 10 '22 21:03 davidslater

Thanks for the feedback. I am going to update the description above based on this feedback.

Some follow on thoughts/questions:

  • For the datasets vs adv_datasets, could we just move to an output format all around of something like (x, xadv), (y, yadv) and then if they are non-adversarial it would return (x, None), (y, None)?
  • The thought about dataset folder structure is exactly what I was thinking as well...I will look at tfds cli and play around a little bit.
  • Understood about "support" of local datasets....I will give that some thought

shenshaw26 avatar Mar 11 '22 12:03 shenshaw26

I wouldn't want to go to an output format like that as adversarial datasets are not commonly used, and that format would be weird in all other contexts.

davidslater avatar Mar 11 '22 16:03 davidslater

I wouldn't want to go to an output format like that as adversarial datasets are not commonly used, and that format would be weird in all other contexts.

What about something like x, y, *otherbits so then all datasets would have "same" format, and then if you want/need other bits you can always grab it.

hensh2ss avatar Mar 14 '22 16:03 hensh2ss

No, I would like to keep the base datasets as standard as possible, which is to make them return (x, y) and only have the adversarial ones do something different.

You can fit the adversarial ones into the (x, y) format - they can just have more structure, like ((x_ben, x_adv), (y_true, y_target))

davidslater avatar Mar 14 '22 16:03 davidslater

Since TFDS v4 is merged in, I wanted to update the name

davidslater avatar Mar 18 '22 16:03 davidslater