metaflow
metaflow copied to clipboard
Add support for plain pip package installation (similar to Conda)
It would be nice to have this when you are not using Conda, the purpose being to track the plain deps as part of the Flow.
I am not sure what you mean. You can currently do something like os.system(‘pip install ...’) as part of your flow although that is not our recommendation as this will not deal with changes in transitive dependencies. Conda allows you to have a reproducible environment (and the dependencies are versioned with the flow) but you can definitely do without and use the approach above to install dependencies.
@pombredanne What would be the expected behaviour of such support? Currently, we create isolated environments for every step and many data science packages have system package dependencies which pip cannot handle. As for installing pip packages within a step, you can follow @romain-intel ’s advice.
I can install a pip package in a plain call alright. This ticket is to offer the same features as the Conda plugin https://github.com/Netflix/metaflow/tree/5c047cf6950975e5ea1b69bbc89fa1ff80cfa004/metaflow/plugins/conda for plain pip (possibly assuming in the first place that we are running in some venv for a start TBD). Basically versioning its deps with a flow, but without Conda which is popular only in some circles to provide a non Conda way to have a reproducible execution environment.
@pombredanne, to make sure I understand, you effectively want the functionality of the "conda" plugin (versioning dependencies) but instead of using conda, use plain pip and venv to offer that support. Is this correct?
@romain-intel you wrote:
to make sure I understand, you effectively want the functionality of the "conda" plugin (versioning dependencies) but instead of using conda, use plain pip and venv to offer that support. Is this correct?
exactly... though in earnest that convenience would have to be weighted vs. packaging a workflow and its deps externally before doing anything there.
And if you assume that you are already running in a some venv-like isolated environment, then you can focus only on the pip side, e.g. install a set of frozen/pinned/hashed Python requirements.
(FWIW I routinely automate pip/venv things with things such as https://github.com/nexB/scancode-toolkit/blob/develop/etc/configure.py
See also some POC to declare deps beyond a single package manager with https://gist.github.com/pombredanne/d3585617882f91d9316be5ce5eddf190 though there is a level of deps complexity that ends up better frozen in a container instead.
Actually in hindsight, I cannot fathom a single use case on my side where pip packages would be specific to a step, so this does not make sense to have as a step decorator in my usage. This is more of a flow-level setup thing which therefore is global and not a step: therefore I am closing this.
@pombredanne: I'd actually be interested in exactly the same thing. Managing the requirements via pip instead of conda seems like a good idea. I have a couple of python packages in private pypi-repo on which my pipeline relies on. I would like to add those. Any idea how to do that?
@lgilz I would know more or less how to do it: this about essentially duplicating the conda
feature to support pip + virtualenv in a way simialr to how conda is supported. It should not be hard but there is real work needed there, like several days ++
A shortcut would be to support only flow-level packages and leave aside step-level packages
As a temporary solution, I currently use a helper decorator if people need pip dependencies which they cannot find on Anaconda.
def pip(libraries):
def decorator(function):
@functools.wraps(function)
def wrapper(*args, **kwargs):
import subprocess
import sys
for library, version in libraries.items():
print('Pip Install:', library, version)
subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', library + '==' + version])
return function(*args, **kwargs)
return wrapper
return decorator
You can use the decorator like this:
@conda(libraries={'pandas': '0.25.0', 'scikit-learn': '0.22'})
@pip(libraries={'fasttext': '0.9.1'})
@step
def process(self):
pass
I know it's a hacky way and would be great to see this officially supported. But I wanted to share my current solution. Maybe that is of help to someone.
@savingoyal
@philipphager The @conda
decorator, currently, does much more than just setting up the needed conda environment. On the backend, it snapshots the packages (including transitive dependencies) and appropriate metadata so that the environment can be reproduced anytime in the future. For packages that are not present in the conda universe, your solution works fine, if the reproducibility of the execution environment is not a concern.
Allowing venv
directly instead of relying on a third-party solution, like conda, would help to manage environments and reduce the complexity of any project. With wheel
, system dependencies with pip packages would not be a problem.
I was very surprised to see that conda is a requirement in 2020. To be clear, I think it's good that using conda to ship dependencies is an option since it is familiar to a significant fraction of data scientists.
But the mainstream python ecosystem now solves most of the problems conda solves quite well through wheels, i.e. "data science packages have system package dependencies which pip cannot handle" is no longer true, assuming your deployment target is a mainstream architecture that can use manylinux wheels.
The current conda requirment means that, in order to align with my colleagues in software engineering by using pip, I have to maintain package versions in multiple places. These will inevitably get out of sync, which means we're back to the pre-Docker complaints of "it works on my machine". I would love to see an analogous workflow using standard python tooling (pip, virtualenv, perhaps pip-tools or poetry) that would eliminate this complexity, and allow a single source of truth for the environment.
I don't have a ton of suggestions for how this could work in practice, so just throwing out an example: as a user, I love the lightweight experience of using zappa (see How Zappa Makes Packages). That defines the dependencies to be snapshotted (and shipped to AWS) to be exactly what it sees in the local virtualenv (with the option to exclude, e.g. development dependencies such as notebook or pytest). This makes it possible to use standard tools and be confident the environment is the same on all hosts.
I'd be curious if this could be accommodated by piggybacking on conda's ability to create an environment from an environment.yml
file, which supports also having pip installs. There's still some caveats where the dependency resolutions occur separately (pip afterwards), so ideally it's only used as a fallback when required. Maybe as part of the conda decorator's manifest and environment creation an intermediate yaml could be used.
My bread and butter for dockerfiles is to use a conda env create -f environment.yml
with an environment.yml
file along the lines of:
name: <env id>
channels:
- defaults
- conda-forge
dependencies:
- awscli
- numpy
- pandas>=1.1
- ...
- conda-forge::boost-cpp
- pip:
- psycopg2-binary
- ...
FWIW metaflow supporting this isn't a big deal for me personally, mostly just thinking out loud 🙂
In some work environment (like mine), people don't have direct access to public conda
channels and pypi
, instead the workplace provides "controlled" index repos for these public packages/libraries which is a (much) smaller subset of the public one. Some packages (at the internal index repos) are available in conda
while some are available in pypi
, and packages' version varies a lot, too. So in work I find myself very often need a version from the pypi
repo, which is not available in the conda
repo, but @conda
can't help here. For this specific case, I think something like @pip
would be great.
@xujiboy it sounds like you may want to use a conda custom channel as your private repository.
You'd then be able to use it like
CONDA_CHANNELS=<your channel> python flow.py run
@xujiboy it sounds like you may want to use a conda custom channel as your private repository.
You'd then be able to use it like
CONDA_CHANNELS=<your channel> python flow.py run
Thank you for your suggestion @russellbrooks . It is possible for me to get packages available in our internal conda channel, it is just not very convenient and takes time. To create and use a custom channel I need to get the packages "in" anyway, so by then a custom channel won't be needed. For me personally it is the mismatch between internal conda and pypi repos, where the later is already quite rich with various packages suiting my needs.
another 👍 https://gitter.im/metaflow_org/community?at=6081a180b9e6de24d64fb32e
We'd also be very interested in having this feature.
Extending the @conda
decorator as suggested in https://github.com/Netflix/metaflow/issues/24#issuecomment-698018767 would be really convenient, as it should eliminate the need to re-package various pip-only dependencies.
Extending the
@conda
decorator as suggested in #24 (comment) would be really convenient, as it should eliminate the need to re-package various pip-only dependencies.
See also https://github.com/Netflix/metaflow/issues/395#issuecomment-740302903
Thinking about the issue with pinning versions of transitive dependencies:
Could a reliable @pip
decorator be implemented by generating a Pipfile.lock
as an artifact? I'm not sure how feasible this is, but when reproducing historical runs, you the task could grab the generated Pipfile.lock
and pip install ...
those requirements.
Personally, my team has a private PyPI server in AWS CodeArtifact where we push libraries with useful utilities for our team. We really want to be able to do the equivalent of aws codeartifact login ...
and then have our task install our internal packages in our private PyPI server.
Would a workaround for this presently be to somehow publish all of our packages to a custom conda channel in addition to publishing them to our PyPI server? (so basically, we'd just continue to use conda
and skip pip
altogether)
https://docs.metaflow.org/scaling/dependencies/libraries#pypi-in-action
Thank you so much! This is great!