pytorch-lightning
pytorch-lightning copied to clipboard
(app) Add s3 drive type (1/2)
What does this PR do?
~This PR adds a list of Drives to the app Work specification, this will enable apps to mount a given drive in the work and be able to access it through the file system (Using for example file system data loaders). This is an alternative way to work with drives than the existing access pattern and will later be expanded with additional options for optimizations.~
@rlizzo edits: This PR has been updated after feedback to basically only include the configuration for setting an s3 based Drive Type. A follow up PR will be made which allows us to inspect Drive
objects attacked to each work attribute and then configure them based on the protocol (s3, etc). At the advice of @tchaton we have have removed the ability to use s3 based drives from the Drive
get/list/put/delete
method. If we want to enable this in the future, we will have a discussion on the API with @awaelchli and other interested parties.
All the changes presented in this PR related to running in a cloud environment, when running locally, specifying work drives will not have any effect (However, you can mimic the cloud environment by placing of symlinking files to the same root folder).
Additionally, this PR updates the Drive spec:
- Adding a new protocol
s3://
allowing the creation of drives directly from S3 buckets. - ~Adding a new, optional
optimization
field allowing users to specify optimizations for Drives when running on the cloud environment.~
Does your PR introduce any breaking changes? If yes, please list them.
This PR should be fully backwards compatible with existing code and the Work.drives field is optional.
cc @borda
Please see new PR description for updated changes here. I've taken over work here for @panos-is now that he is on-call and we brought this PR in a slightly different direction.
I don't think we should use the protocol (s3:// vs lit://) as an indication on how the drive should be mounted. We should instead have an explicit flag for mounting, so we'll be able to mount non-s3:// locations (e.g. lit:// themselves, or GCP storage).
Eventually yes, but there's no particular reason to frontload this work here. Lets add these features when we get to them instead of adding a mount flag but only supporting lit+false
and s3+true
here. That's bound to be somewhat confusing to the users.
@lantiga I've been going back and forth here, but I think I agree with @panos-is. When we have additional drive types and options, it'll make sense to introduce more flags, but for right now the proposed solution is simple to explain/understand & doesn't require us to introduce breaking changes in the near future when we have a demonstrated need for additional arguments & behaviors.
My need is about the now though (while designing the training app): how can we mount a lit:// drive? Should we surface its underlying s3 bucket?
Here's the use case: I use a notebook to download a dataset from somewhere (not s3) and I use a lit:// Drive to store the data. Then I want to mount that data to a Work. How do I do that?
Same goes with: I have a data-generating process that puts data into a lit:// Drive. Then I want to mount the Drive into a Work to run a training job.
Let's try to figure out a way to enable this because it's going to be a pretty common use case once we roll out the app for training. I'm good with proceeding incrementally, but relying on protocol is guaranteed to not be future-proof in a short timespan.
Any feedback on the strategy to enable mounting lit://
drives RO from an API perspective?
A cheap way (in a future PR) would be in this case to snapshot a lit://
drive to an s3://
bucket and return the bucket location, so
- it becomes immutable, so no problem if the
lit://
drive keeps changing if another Work writes to it, and we can safely mount it RO - we have a
s3://
location we can use for mounting, aka we can rely on the protocol to mount it (although I 100% do not like this approach - ok I've made my point)
All the above can happen behind the scenes, with the obvious disadvantage that we duplicate data, but that can be a clear caveat.
WDYT @panos-is @rlizzo
@lantiga
Any feedback on the strategy to enable mounting
lit://
drives RO from an API perspective?
the API for mount: bool = True
argument sounds good to me.
A cheap way (in a future PR) would be in this case to snapshot a
lit://
drive to ans3://
bucket and return the bucket location, so
- it becomes immutable, so no problem if the
lit://
drive keeps changing if another Work writes to it, and we can safely mount it RO- we have a
s3://
location we can use for mounting, aka we can rely on the protocol to mount it (although I 100% do not like this approach - ok I've made my point)
I think this is a bit overkill. For lit://
drives we actually have the option of enabling real-time r+w
capabilities (with real-time updates) if we are willing to accept the performance hit of not using our proprietary backend code path (I'm being intentionally ambiguous since this is a public repo). This isn't much work to do at all (I'd estimate < 1 week for either @panos-is or myself).
We could also skip our proprietary code paths like above and solely enable ro
support (with real-time updates) The benefit of ro
mode is that it would prevent the Drive state transfers from getting out of sync with files written to the mount path.