metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

Support for another public cloud - Microsoft Azure

Open leifericf opened this issue 5 years ago • 21 comments

Currently, Metaflow is set up to work with AWS as the default public cloud. The architecture of Metaflow allows for additional public clouds to be supported.

Adding support for Microsoft Azure might broaden the potential user base, which could increase the adaption rate. This, in turn, could lead to increased community attention.

leifericf avatar Dec 10 '19 12:12 leifericf

I can dedicate a few hours here and there for Azure support, but I don't have time to take the reins on this one. If someone goes through the trouble of designing and proposing a solution and could use some extra hands for the implementation, loop me in.

gplusplus314 avatar Dec 10 '19 14:12 gplusplus314

@gerryhernandez: I have asked my contacts at Microsoft (Norwegian HQ) whether they would be willing to pitch in with funding and/or time from their engineers.

leifericf avatar Dec 12 '19 09:12 leifericf

Yes, any update from Microsoft? If they don't have plan to do so. Can we fork a branch and add Azure enhancements on our own?

jwang01 avatar Feb 24 '20 18:02 jwang01

Hello! I'm in discussion with my colleagues from Microsoft Norway about this project. @jwang01 do you want to help with implementing?

webmaxru avatar Feb 25 '20 15:02 webmaxru

What's the main challenge you can see now? Converting the AWS Cloudformation templates to ARM templates?

ylulloa avatar Feb 27 '20 18:02 ylulloa

That might be a good start :)

webmaxru avatar Feb 28 '20 10:02 webmaxru

I wonder if Kubernetes/Helm would be a better option than ARM? The result would then potentially be cloud-agnostic.

nabsul avatar Feb 28 '20 16:02 nabsul

Any chances of this getting traction

vermaakarsh avatar Jul 21 '20 09:07 vermaakarsh

Any chances of this getting traction

Ditto. Any luck?

onacrame avatar Aug 11 '21 06:08 onacrame

Also curious about this feature. Any updates, @webmaxru or @gerryhernandez?

pikulmar avatar Oct 04 '21 14:10 pikulmar

@pikulmar With the new datastore implementation (#580), it should be now rather straightforward to integrate with Azure Blob Store. With the kubernetes support for compute and orchestration (#644), one can reliably run workloads on AKS. Let us know if you would like to help test out #644!

savingoyal avatar Oct 04 '21 14:10 savingoyal

@pikulmar With the new datastore implementation (#580), it should be now rather straightforward to integrate with Azure Blob Store. With the kubernetes support for compute and orchestration (#644), one can reliably run workloads on AKS. Let us know if you would like to help test out #644!

@savingoyal Yes, definitely! I will give it a try and let you know how things go.

pikulmar avatar Oct 12 '21 06:10 pikulmar

@pikulmar With the new datastore implementation (#580), it should be now rather straightforward to integrate with Azure Blob Store. With the kubernetes support for compute and orchestration (#644), one can reliably run workloads on AKS. Let us know if you would like to help test out #644!

@savingoyal Yes, definitely! I will give it a try and let you know how things go.

@savingoyal Ok, I had a first go at this:

  1. Submitting Kubernetes jobs to AKS using the branch from #644 works. As expected, jobs fail to complete because they cannot access the code package in S3. So, the name of the game is to use #580 in order to add Azure Blob Storage support.

  2. In order to bring #580 in, I merged current master into plugin-linter (target branch of #644), yielding https://github.com/fortum-tech/metaflow/tree/plugin-linter-update. I subsequently merged #644 into the former branch, yielding https://github.com/fortum-tech/metaflow/tree/plugin-linter-update-k8s. Finally, I implemented DataStoreStorage using cloudpathlib. The result can be found in https://github.com/fortum-tech/metaflow/tree/plugin-linter-update-k8s-cloudpathlib. This yielded a first successful step execution on AKS. However, there is still work left to do. I tried to summarize the open issues in https://github.com/fortum-tech/metaflow/blob/plugin-linter-update-k8s-cloudpathlib/metaflow/datastore/cloudpathlib_storage.py#L13.

I could use some input as to how to proceed. I see two paths:

  • Open PR to merge https://github.com/fortum-tech/metaflow/tree/plugin-linter-update into master. Then, merge https://github.com/fortum-tech/metaflow/tree/plugin-linter-update-k8s into master (e.g. via #644, after retargetting). These branches contain merge commits, let me know if you consider this a problem.

  • Rebase #644 on top of current master and proceed from there.

In either case, if you think the approach is promising, we could consider opening a (separate) PR for the cloudpathlib data store feature.

Let me know what you think.

UPDATE: Using the latest https://github.com/fortum-tech/metaflow/tree/plugin-linter-update-k8s-cloudpathlib, the @conda decorator also works as expected. As already noted, there is some redundancy between metaflow.datatools.S3 and DataStoreStorage. Getting @conda to work on AKS only required using DataStoreStorage. Perhaps metaflow.datatools.S3 could be removed completely at some point? Alternatively, one could replace metaflow.datatools.S3 by metaflow.datatools.DataStoreTools which implements (some of) the existing metaflow.datatools.S3 API in a cloud-agnostic manner, based on DataStoreStorage?

pikulmar avatar Oct 12 '21 18:10 pikulmar

@pikulmar

  • Yes, we can get rid of metaflow.datatools.S3 from the @conda implementation entirely - if you would like to submit a PR please let me know!
  • PR #580 and #644 have been merged into the Metaflow codebase - I would be happy to take a look at your Azure datastore PR - I am not very familiar with cloudpathlib - is there a specific reason to opt for that in lieu of Azure's Python SDK?

savingoyal avatar Oct 18 '21 18:10 savingoyal

Any update on supporting MSFT Azure ?

Dana-Farber avatar Nov 24 '21 15:11 Dana-Farber

I posted an invitation on my social media. Please, amplify for reaching out to potential contributors:

https://www.linkedin.com/posts/webmax_support-for-another-public-cloud-microsoft-activity-6869342352914309120-V4U8

https://twitter.com/webmaxru/status/1463576009781956616?s=20

webmaxru avatar Nov 24 '21 18:11 webmaxru

@savingoyal It appears that we might not require Azure Blob Storage support in Metaflow after all (we might decide to share details on this later), which is why I am not sure how much time we would be able to dedicate to a corresponding PR at this time.

Regarding your question, there is a trade off between versatility and performance:

  • Using cloudpathlib makes implementation of the Datastore API particularly straightfoward and has the advantage of supporting all cloud storage services supported by cloudpathlib, including those added in the future.

  • Using cloud-specific libraries like boto3, on the other hand, can provide improved performance by using parallel data transfers etc. (I am not aware of cloudpathlib supporting that yet). But, of course, it requires work for each additional cloud storage service to be supported.

Generally, development could be split into multiple PRs:

  1. Refactor metaflow.datatools.S3 (and perhaps other parts of the code) to exclusively rely on the new Datastore abstraction.

  2. a. Implement a cloudpathlib-based Datastore similar to what was linked/discussed above.

    b. Add an Azure-specific, performance optimized Datastore implementation if this is of interest.

Question: Can Datastore implementations also be managed as plug-ins (via the metaflow_extensions mechanism) and, if yes, would such an approach be preferred for contributions 2a and/or 2b?

pikulmar avatar Nov 30 '21 08:11 pikulmar

A few notes:

  • currently datastore cannot be added via the metaflow_extensions mechanism but that should be possible very shortly (it's very trivial to do and just requires a tiny code reorg -- it's planned but I didn't do it yet given all the other changes that were in flight).
  • I would rather keep the metaflow.datatools.S3 as is; currently the S3 storage implementation for the datastore relies on it so it would be a bit circular if we made it depend on the datastore. Everything else should be going through the datastore. It's a good question whether or not we keep it as part of core metaflow though or move it to a more S3 specific portion.
  • Adding an implementation with cloudpathlib should hopefully be very easy; you just have to implement the storage part which has just a few methods. Similar for the azure-specific one as well. Everything in the datastore boils down to those functions. There is a requirement to store "metadata" about the file but that can be stored as a separate blob as well as in the local implementation.

romain-intel avatar Dec 03 '21 18:12 romain-intel

This work is currently in-flight. We expect a feature-complete PR to be available over the next couple of weeks. It will cover AzureBlobStorage as our Azure Datastore and run on top of Kubernetes (AKS or BYOC).

savingoyal avatar Jul 12 '22 01:07 savingoyal

Wonderful! Would you mind letting me know when this is done and testable? Thx!!!

On Tue, Jul 12, 2022, 3:27 AM Savin @.***> wrote:

This work is currently in-flight. We expect a feature-complete PR to be available over the next couple of weeks. It will cover AzureBlobStorage as our Azure Datastore and run on top of Kubernetes (AKS or BYOC).

— Reply to this email directly, view it on GitHub https://github.com/Netflix/metaflow/issues/40#issuecomment-1181211268, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBTFDJOQ634NKM2L52KKUTVTTCYJANCNFSM4JY5XDXA . You are receiving this because you commented.Message ID: @.***>

Dana-Farber avatar Jul 13 '22 06:07 Dana-Farber

@Dana-Farber The PR is now available for testing.

savingoyal avatar Jul 20 '22 18:07 savingoyal