metaflow
metaflow copied to clipboard
Adding support for Azure Blob Storage as a datastore
The primary change is implementing AzureStorage
(analogous to existing S3Storage
, LocalStorage
). We are consciously deferring the decision of having first class "data tools" support for Azure.
There are some necessary changes to ensure full Azure support on all Metaflow surfaces:
- includefile
- conda
- cards
- mflog
- kubernetes
- argo
We take care to ensure there is no cross disruption to users not using Azure. More specifically:
- Users need to setup AWS dependencies (boto3, config params), iff they are using AWS.
- Users need to setup Azure dependencies (azure SDK libs, config params), iff they are using Azure.
We aggressively use local imports to achieve this.
Much effort was also spent to ensure good performance of Metaflow's usage of Azure Blob Storage. See context docs for more details.
Some docs for context:
- Overall Metaflow-on-Azure effort.
- AzureStorage implementation notes (may not match up 100% to code in PR in minor ways, but the bones should be consistent).
- Performance discovery (pre-implementation).
Hi there, I'm eager to try this out. I have some basic questions.
- Is there a simple "getting started" or "howto" example for setting up Azure support?
- Now that there will be different dependencies for different backends, have you considered using install extras offered by setuptools? It would offer a neat way to specify versions of the dependencies. I mean these: https://setuptools.pypa.io/en/latest/userguide/dependency_management.html#optional-dependencies but maybe this is handled elsewhere in the program (only reference I found was in pinned conda libs).
-
metaflow configure
does not hint at any Azure options, is this intended for now?
Thanks!
Thank you for your interest!
-
How-to guide. We do have some WIP instructions and templates. It’s in private preview state. We will share it the Google docs with you. Please DM us directly in the community Slack!
-
“Pinned packages”. To share some context, even prior to the Azure project, we already have optional dependencies, though we handle it in a somewhat low-tech way (IMO). We use soft imports - i.e. only when users choose to use a feature requiring the optional dependencies do we actually import it. We do this for Kubernetes integration, and we do it for S3 (boto3). Note that for boto3 we actually do list boto3 as non-optional - but if user ends up not using S3 we won’t actually import boto3 at runtime.
We plan to take the same approach for Azure. I.e. we will end up importing Azure deps IFF user really uses Azure. If those Azure deps are missing at runtime, we take care to having friendly messaging to ask the user to install them at that point.
Setuptools’s optional dependencies feature does seem like a promising option to handle dependencies more gracefully. If we do go down this route, we would likely want to handle Kubernetes, S3, azure (and any other optional features) in a uniform way. We would want to think through what the UX will be like for folks who need 0 extras, 1 extra, and 2>=extras. Plus whether the user base at large understands what the “square brackets” mean exactly, etc, etc.
- “metaflow configure azure” was added late last week. Please check it out!
Testing[203] @ 9576d418f31dc0dacd38986e6ace62f588ad12fc
Testing[203] @ 9576d418f31dc0dacd38986e6ace62f588ad12fc had 2 FAILUREs.
Testing[203] @ 9576d418f31dc0dacd38986e6ace62f588ad12fc had 2 FAILUREs.
Testing[203] @ 9576d418f31dc0dacd38986e6ace62f588ad12fc had 6 FAILUREs.
Testing[203] @ 349982f2fa4367e0918477bf4a37371a5b6f423e
Testing[203] @ 349982f2fa4367e0918477bf4a37371a5b6f423e had 1 FAILURE.
Testing is fine now on my end. The one failure is transient. Small comments remain and we can then go ahead and merge.
NOTE: Not compatible with include file changes. This PR can be made compatible fairly simply by adding the relevant methods in the includefile_support.py file.
I'll re-run the tests as well.
Testing[203] @ 9ae38fecb4b3ca086c35400e7a88e3a0dec91a49
Hun, for some reason there is an issue now (can't find JSONType
). I'll take a look in a bit.
Testing[203] @ 7acd60e46c7cfe100bf087304af85628c4380354
Testing[203] @ 7acd60e46c7cfe100bf087304af85628c4380354 PASSED