metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

Adding support for Azure Blob Storage as a datastore

Open jackie-ob opened this issue 2 years ago • 9 comments

The primary change is implementing AzureStorage (analogous to existing S3Storage, LocalStorage). We are consciously deferring the decision of having first class "data tools" support for Azure.

There are some necessary changes to ensure full Azure support on all Metaflow surfaces:

  • includefile
  • conda
  • cards
  • mflog
  • kubernetes
  • argo

We take care to ensure there is no cross disruption to users not using Azure. More specifically:

  • Users need to setup AWS dependencies (boto3, config params), iff they are using AWS.
  • Users need to setup Azure dependencies (azure SDK libs, config params), iff they are using Azure.

We aggressively use local imports to achieve this.

Much effort was also spent to ensure good performance of Metaflow's usage of Azure Blob Storage. See context docs for more details.

Some docs for context:

jackie-ob avatar Jul 20 '22 17:07 jackie-ob

Hi there, I'm eager to try this out. I have some basic questions.

  1. Is there a simple "getting started" or "howto" example for setting up Azure support?
  2. Now that there will be different dependencies for different backends, have you considered using install extras offered by setuptools? It would offer a neat way to specify versions of the dependencies. I mean these: https://setuptools.pypa.io/en/latest/userguide/dependency_management.html#optional-dependencies but maybe this is handled elsewhere in the program (only reference I found was in pinned conda libs).
  3. metaflow configure does not hint at any Azure options, is this intended for now?

Thanks!

Mikkolehtimaki avatar Jul 24 '22 12:07 Mikkolehtimaki

Thank you for your interest!

  1. How-to guide. We do have some WIP instructions and templates. It’s in private preview state. We will share it the Google docs with you. Please DM us directly in the community Slack!

  2. “Pinned packages”. 

To share some context, even prior to the Azure project, we already have optional dependencies, though we handle it in a somewhat low-tech way (IMO). We use soft imports - i.e. only when users choose to use a feature requiring the optional dependencies do we actually import it. We do this for Kubernetes integration, and we do it for S3 (boto3). Note that for boto3 we actually do list boto3 as non-optional - but if user ends up not using S3 we won’t actually import boto3 at runtime.



We plan to take the same approach for Azure. I.e. we will end up importing Azure deps IFF user really uses Azure. If those Azure deps are missing at runtime, we take care to having friendly messaging to ask the user to install them at that point.



Setuptools’s optional dependencies feature does seem like a promising option to handle dependencies more gracefully. If we do go down this route, we would likely want to handle Kubernetes, S3, azure (and any other optional features) in a uniform way. We would want to think through what the UX will be like for folks who need 0 extras, 1 extra, and 2>=extras. Plus whether the user base at large understands what the “square brackets” mean exactly, etc, etc.

  1. “metaflow configure azure” was added late last week. Please check it out!

jackie-ob avatar Jul 25 '22 17:07 jackie-ob

Testing[203] @ 9576d418f31dc0dacd38986e6ace62f588ad12fc

nflx-mf-bot avatar Aug 01 '22 15:08 nflx-mf-bot

Testing[203] @ 9576d418f31dc0dacd38986e6ace62f588ad12fc had 2 FAILUREs.

nflx-mf-bot avatar Aug 02 '22 07:08 nflx-mf-bot

Testing[203] @ 9576d418f31dc0dacd38986e6ace62f588ad12fc had 2 FAILUREs.

nflx-mf-bot avatar Aug 02 '22 07:08 nflx-mf-bot

Testing[203] @ 9576d418f31dc0dacd38986e6ace62f588ad12fc had 6 FAILUREs.

nflx-mf-bot avatar Aug 02 '22 09:08 nflx-mf-bot

Testing[203] @ 349982f2fa4367e0918477bf4a37371a5b6f423e

nflx-mf-bot avatar Aug 02 '22 17:08 nflx-mf-bot

Testing[203] @ 349982f2fa4367e0918477bf4a37371a5b6f423e had 1 FAILURE.

nflx-mf-bot avatar Aug 02 '22 19:08 nflx-mf-bot

Testing is fine now on my end. The one failure is transient. Small comments remain and we can then go ahead and merge.

romain-intel avatar Aug 03 '22 06:08 romain-intel

NOTE: Not compatible with include file changes. This PR can be made compatible fairly simply by adding the relevant methods in the includefile_support.py file.

romain-intel avatar Aug 17 '22 16:08 romain-intel

I'll re-run the tests as well.

romain-intel avatar Aug 17 '22 16:08 romain-intel

Testing[203] @ 9ae38fecb4b3ca086c35400e7a88e3a0dec91a49

nflx-mf-bot avatar Aug 17 '22 17:08 nflx-mf-bot

Hun, for some reason there is an issue now (can't find JSONType). I'll take a look in a bit.

romain-intel avatar Aug 18 '22 09:08 romain-intel

Testing[203] @ 7acd60e46c7cfe100bf087304af85628c4380354

nflx-mf-bot avatar Aug 19 '22 16:08 nflx-mf-bot

Testing[203] @ 7acd60e46c7cfe100bf087304af85628c4380354 PASSED

nflx-mf-bot avatar Aug 19 '22 19:08 nflx-mf-bot