metaflow
metaflow copied to clipboard
Support for another public cloud - Microsoft Azure
Currently, Metaflow is set up to work with AWS as the default public cloud. The architecture of Metaflow allows for additional public clouds to be supported.
Adding support for Microsoft Azure might broaden the potential user base, which could increase the adaption rate. This, in turn, could lead to increased community attention.
I can dedicate a few hours here and there for Azure support, but I don't have time to take the reins on this one. If someone goes through the trouble of designing and proposing a solution and could use some extra hands for the implementation, loop me in.
@gerryhernandez: I have asked my contacts at Microsoft (Norwegian HQ) whether they would be willing to pitch in with funding and/or time from their engineers.
Yes, any update from Microsoft? If they don't have plan to do so. Can we fork a branch and add Azure enhancements on our own?
Hello! I'm in discussion with my colleagues from Microsoft Norway about this project. @jwang01 do you want to help with implementing?
What's the main challenge you can see now? Converting the AWS Cloudformation templates to ARM templates?
That might be a good start :)
I wonder if Kubernetes/Helm would be a better option than ARM? The result would then potentially be cloud-agnostic.
Any chances of this getting traction
Any chances of this getting traction
Ditto. Any luck?
Also curious about this feature. Any updates, @webmaxru or @gerryhernandez?
@pikulmar With the new datastore implementation (#580), it should be now rather straightforward to integrate with Azure Blob Store. With the kubernetes support for compute and orchestration (#644), one can reliably run workloads on AKS. Let us know if you would like to help test out #644!
@pikulmar With the new datastore implementation (#580), it should be now rather straightforward to integrate with Azure Blob Store. With the kubernetes support for compute and orchestration (#644), one can reliably run workloads on AKS. Let us know if you would like to help test out #644!
@savingoyal Yes, definitely! I will give it a try and let you know how things go.
@pikulmar With the new datastore implementation (#580), it should be now rather straightforward to integrate with Azure Blob Store. With the kubernetes support for compute and orchestration (#644), one can reliably run workloads on AKS. Let us know if you would like to help test out #644!
@savingoyal Yes, definitely! I will give it a try and let you know how things go.
@savingoyal Ok, I had a first go at this:
-
Submitting Kubernetes jobs to AKS using the branch from #644 works. As expected, jobs fail to complete because they cannot access the code package in S3. So, the name of the game is to use #580 in order to add Azure Blob Storage support.
-
In order to bring #580 in, I merged current
master
intoplugin-linter
(target branch of #644), yielding https://github.com/fortum-tech/metaflow/tree/plugin-linter-update. I subsequently merged #644 into the former branch, yielding https://github.com/fortum-tech/metaflow/tree/plugin-linter-update-k8s. Finally, I implementedDataStoreStorage
usingcloudpathlib
. The result can be found in https://github.com/fortum-tech/metaflow/tree/plugin-linter-update-k8s-cloudpathlib. This yielded a first successful step execution on AKS. However, there is still work left to do. I tried to summarize the open issues in https://github.com/fortum-tech/metaflow/blob/plugin-linter-update-k8s-cloudpathlib/metaflow/datastore/cloudpathlib_storage.py#L13.
I could use some input as to how to proceed. I see two paths:
-
Open PR to merge https://github.com/fortum-tech/metaflow/tree/plugin-linter-update into
master
. Then, merge https://github.com/fortum-tech/metaflow/tree/plugin-linter-update-k8s intomaster
(e.g. via #644, after retargetting). These branches contain merge commits, let me know if you consider this a problem. -
Rebase #644 on top of current
master
and proceed from there.
In either case, if you think the approach is promising, we could consider opening a (separate) PR for the cloudpathlib
data store feature.
Let me know what you think.
UPDATE: Using the latest https://github.com/fortum-tech/metaflow/tree/plugin-linter-update-k8s-cloudpathlib, the @conda
decorator also works as expected. As already noted, there is some redundancy between metaflow.datatools.S3
and DataStoreStorage
. Getting @conda
to work on AKS only required using DataStoreStorage
. Perhaps metaflow.datatools.S3
could be removed completely at some point? Alternatively, one could replace metaflow.datatools.S3
by metaflow.datatools.DataStoreTools
which implements (some of) the existing metaflow.datatools.S3
API in a cloud-agnostic manner, based on DataStoreStorage
?
@pikulmar
- Yes, we can get rid of
metaflow.datatools.S3
from the@conda
implementation entirely - if you would like to submit a PR please let me know! - PR #580 and #644 have been merged into the Metaflow codebase - I would be happy to take a look at your Azure datastore PR - I am not very familiar with
cloudpathlib
- is there a specific reason to opt for that in lieu of Azure's Python SDK?
Any update on supporting MSFT Azure ?
I posted an invitation on my social media. Please, amplify for reaching out to potential contributors:
https://www.linkedin.com/posts/webmax_support-for-another-public-cloud-microsoft-activity-6869342352914309120-V4U8
https://twitter.com/webmaxru/status/1463576009781956616?s=20
@savingoyal It appears that we might not require Azure Blob Storage support in Metaflow after all (we might decide to share details on this later), which is why I am not sure how much time we would be able to dedicate to a corresponding PR at this time.
Regarding your question, there is a trade off between versatility and performance:
-
Using
cloudpathlib
makes implementation of theDatastore
API particularly straightfoward and has the advantage of supporting all cloud storage services supported bycloudpathlib
, including those added in the future. -
Using cloud-specific libraries like
boto3
, on the other hand, can provide improved performance by using parallel data transfers etc. (I am not aware ofcloudpathlib
supporting that yet). But, of course, it requires work for each additional cloud storage service to be supported.
Generally, development could be split into multiple PRs:
-
Refactor
metaflow.datatools.S3
(and perhaps other parts of the code) to exclusively rely on the newDatastore
abstraction. -
a. Implement a
cloudpathlib
-basedDatastore
similar to what was linked/discussed above.b. Add an Azure-specific, performance optimized
Datastore
implementation if this is of interest.
Question: Can Datastore
implementations also be managed as plug-ins (via the metaflow_extensions
mechanism) and, if yes, would such an approach be preferred for contributions 2a and/or 2b?
A few notes:
- currently datastore cannot be added via the
metaflow_extensions
mechanism but that should be possible very shortly (it's very trivial to do and just requires a tiny code reorg -- it's planned but I didn't do it yet given all the other changes that were in flight). - I would rather keep the
metaflow.datatools.S3
as is; currently the S3 storage implementation for the datastore relies on it so it would be a bit circular if we made it depend on the datastore. Everything else should be going through the datastore. It's a good question whether or not we keep it as part of core metaflow though or move it to a more S3 specific portion. - Adding an implementation with
cloudpathlib
should hopefully be very easy; you just have to implement the storage part which has just a few methods. Similar for the azure-specific one as well. Everything in the datastore boils down to those functions. There is a requirement to store "metadata" about the file but that can be stored as a separate blob as well as in the local implementation.
This work is currently in-flight. We expect a feature-complete PR to be available over the next couple of weeks. It will cover AzureBlobStorage as our Azure Datastore and run on top of Kubernetes (AKS or BYOC).
Wonderful! Would you mind letting me know when this is done and testable? Thx!!!
On Tue, Jul 12, 2022, 3:27 AM Savin @.***> wrote:
This work is currently in-flight. We expect a feature-complete PR to be available over the next couple of weeks. It will cover AzureBlobStorage as our Azure Datastore and run on top of Kubernetes (AKS or BYOC).
— Reply to this email directly, view it on GitHub https://github.com/Netflix/metaflow/issues/40#issuecomment-1181211268, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBTFDJOQ634NKM2L52KKUTVTTCYJANCNFSM4JY5XDXA . You are receiving this because you commented.Message ID: @.***>
@Dana-Farber The PR is now available for testing.