metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

Add support for plain pip package installation (similar to Conda)

Open pombredanne opened this issue 5 years ago • 22 comments

It would be nice to have this when you are not using Conda, the purpose being to track the plain deps as part of the Flow.

pombredanne avatar Dec 05 '19 19:12 pombredanne

I am not sure what you mean. You can currently do something like os.system(‘pip install ...’) as part of your flow although that is not our recommendation as this will not deal with changes in transitive dependencies. Conda allows you to have a reproducible environment (and the dependencies are versioned with the flow) but you can definitely do without and use the approach above to install dependencies.

romain-intel avatar Dec 06 '19 05:12 romain-intel

@pombredanne What would be the expected behaviour of such support? Currently, we create isolated environments for every step and many data science packages have system package dependencies which pip cannot handle. As for installing pip packages within a step, you can follow @romain-intel ’s advice.

savingoyal avatar Dec 06 '19 08:12 savingoyal

I can install a pip package in a plain call alright. This ticket is to offer the same features as the Conda plugin https://github.com/Netflix/metaflow/tree/5c047cf6950975e5ea1b69bbc89fa1ff80cfa004/metaflow/plugins/conda for plain pip (possibly assuming in the first place that we are running in some venv for a start TBD). Basically versioning its deps with a flow, but without Conda which is popular only in some circles to provide a non Conda way to have a reproducible execution environment.

pombredanne avatar Dec 06 '19 08:12 pombredanne

@pombredanne, to make sure I understand, you effectively want the functionality of the "conda" plugin (versioning dependencies) but instead of using conda, use plain pip and venv to offer that support. Is this correct?

romain-intel avatar Dec 06 '19 09:12 romain-intel

@romain-intel you wrote:

to make sure I understand, you effectively want the functionality of the "conda" plugin (versioning dependencies) but instead of using conda, use plain pip and venv to offer that support. Is this correct?

exactly... though in earnest that convenience would have to be weighted vs. packaging a workflow and its deps externally before doing anything there.

And if you assume that you are already running in a some venv-like isolated environment, then you can focus only on the pip side, e.g. install a set of frozen/pinned/hashed Python requirements.

(FWIW I routinely automate pip/venv things with things such as https://github.com/nexB/scancode-toolkit/blob/develop/etc/configure.py

See also some POC to declare deps beyond a single package manager with https://gist.github.com/pombredanne/d3585617882f91d9316be5ce5eddf190 though there is a level of deps complexity that ends up better frozen in a container instead.

pombredanne avatar Dec 06 '19 10:12 pombredanne

Actually in hindsight, I cannot fathom a single use case on my side where pip packages would be specific to a step, so this does not make sense to have as a step decorator in my usage. This is more of a flow-level setup thing which therefore is global and not a step: therefore I am closing this.

pombredanne avatar Dec 06 '19 14:12 pombredanne

@pombredanne: I'd actually be interested in exactly the same thing. Managing the requirements via pip instead of conda seems like a good idea. I have a couple of python packages in private pypi-repo on which my pipeline relies on. I would like to add those. Any idea how to do that?

lgilz avatar Dec 20 '19 15:12 lgilz

@lgilz I would know more or less how to do it: this about essentially duplicating the conda feature to support pip + virtualenv in a way simialr to how conda is supported. It should not be hard but there is real work needed there, like several days ++

A shortcut would be to support only flow-level packages and leave aside step-level packages

pombredanne avatar Dec 20 '19 15:12 pombredanne

As a temporary solution, I currently use a helper decorator if people need pip dependencies which they cannot find on Anaconda.

def pip(libraries):
    def decorator(function):
        @functools.wraps(function)
        def wrapper(*args, **kwargs):
            import subprocess
            import sys

            for library, version in libraries.items():
                print('Pip Install:', library, version)
                subprocess.run([sys.executable, '-m', 'pip', 'install', '--quiet', library + '==' + version])
            return function(*args, **kwargs)

        return wrapper

    return decorator

You can use the decorator like this:

    @conda(libraries={'pandas': '0.25.0', 'scikit-learn': '0.22'})
    @pip(libraries={'fasttext': '0.9.1'})
    @step
    def process(self):
        pass

I know it's a hacky way and would be great to see this officially supported. But I wanted to share my current solution. Maybe that is of help to someone.

philipphager avatar Jan 08 '20 09:01 philipphager

@savingoyal

romain-intel avatar Jan 09 '20 22:01 romain-intel

@philipphager The @conda decorator, currently, does much more than just setting up the needed conda environment. On the backend, it snapshots the packages (including transitive dependencies) and appropriate metadata so that the environment can be reproduced anytime in the future. For packages that are not present in the conda universe, your solution works fine, if the reproducibility of the execution environment is not a concern.

savingoyal avatar Jan 27 '20 18:01 savingoyal

Allowing venv directly instead of relying on a third-party solution, like conda, would help to manage environments and reduce the complexity of any project. With wheel, system dependencies with pip packages would not be a problem.

ispmarin avatar Jul 08 '20 17:07 ispmarin

I was very surprised to see that conda is a requirement in 2020. To be clear, I think it's good that using conda to ship dependencies is an option since it is familiar to a significant fraction of data scientists.

But the mainstream python ecosystem now solves most of the problems conda solves quite well through wheels, i.e. "data science packages have system package dependencies which pip cannot handle" is no longer true, assuming your deployment target is a mainstream architecture that can use manylinux wheels.

The current conda requirment means that, in order to align with my colleagues in software engineering by using pip, I have to maintain package versions in multiple places. These will inevitably get out of sync, which means we're back to the pre-Docker complaints of "it works on my machine". I would love to see an analogous workflow using standard python tooling (pip, virtualenv, perhaps pip-tools or poetry) that would eliminate this complexity, and allow a single source of truth for the environment.

I don't have a ton of suggestions for how this could work in practice, so just throwing out an example: as a user, I love the lightweight experience of using zappa (see How Zappa Makes Packages). That defines the dependencies to be snapshotted (and shipped to AWS) to be exactly what it sees in the local virtualenv (with the option to exclude, e.g. development dependencies such as notebook or pytest). This makes it possible to use standard tools and be confident the environment is the same on all hosts.

mikepqr avatar Jul 28 '20 18:07 mikepqr

I'd be curious if this could be accommodated by piggybacking on conda's ability to create an environment from an environment.yml file, which supports also having pip installs. There's still some caveats where the dependency resolutions occur separately (pip afterwards), so ideally it's only used as a fallback when required. Maybe as part of the conda decorator's manifest and environment creation an intermediate yaml could be used.

My bread and butter for dockerfiles is to use a conda env create -f environment.yml with an environment.yml file along the lines of:

name: <env id>
channels:
- defaults
- conda-forge
dependencies:
- awscli
- numpy
- pandas>=1.1
- ...
- conda-forge::boost-cpp
- pip:
  - psycopg2-binary
  - ...

FWIW metaflow supporting this isn't a big deal for me personally, mostly just thinking out loud 🙂

russellbrooks avatar Sep 23 '20 23:09 russellbrooks

In some work environment (like mine), people don't have direct access to public conda channels and pypi, instead the workplace provides "controlled" index repos for these public packages/libraries which is a (much) smaller subset of the public one. Some packages (at the internal index repos) are available in conda while some are available in pypi, and packages' version varies a lot, too. So in work I find myself very often need a version from the pypi repo, which is not available in the conda repo, but @conda can't help here. For this specific case, I think something like @pip would be great.

xujiboy avatar Oct 02 '20 22:10 xujiboy

@xujiboy it sounds like you may want to use a conda custom channel as your private repository.

You'd then be able to use it like

CONDA_CHANNELS=<your channel> python flow.py run

russellbrooks avatar Oct 06 '20 17:10 russellbrooks

@xujiboy it sounds like you may want to use a conda custom channel as your private repository.

You'd then be able to use it like

CONDA_CHANNELS=<your channel> python flow.py run

Thank you for your suggestion @russellbrooks . It is possible for me to get packages available in our internal conda channel, it is just not very convenient and takes time. To create and use a custom channel I need to get the packages "in" anyway, so by then a custom channel won't be needed. For me personally it is the mismatch between internal conda and pypi repos, where the later is already quite rich with various packages suiting my needs.

xujiboy avatar Oct 11 '20 22:10 xujiboy

another 👍 https://gitter.im/metaflow_org/community?at=6081a180b9e6de24d64fb32e

tuulos avatar Apr 22 '21 16:04 tuulos

We'd also be very interested in having this feature.

cyrillay avatar Oct 12 '21 13:10 cyrillay

Extending the @conda decorator as suggested in https://github.com/Netflix/metaflow/issues/24#issuecomment-698018767 would be really convenient, as it should eliminate the need to re-package various pip-only dependencies.

pikulmar avatar Feb 02 '22 09:02 pikulmar

Extending the @conda decorator as suggested in #24 (comment) would be really convenient, as it should eliminate the need to re-package various pip-only dependencies.

See also https://github.com/Netflix/metaflow/issues/395#issuecomment-740302903

pikulmar avatar Feb 02 '22 09:02 pikulmar

Thinking about the issue with pinning versions of transitive dependencies:

Could a reliable @pip decorator be implemented by generating a Pipfile.lock as an artifact? I'm not sure how feasible this is, but when reproducing historical runs, you the task could grab the generated Pipfile.lock and pip install ... those requirements.

Personally, my team has a private PyPI server in AWS CodeArtifact where we push libraries with useful utilities for our team. We really want to be able to do the equivalent of aws codeartifact login ... and then have our task install our internal packages in our private PyPI server.

Would a workaround for this presently be to somehow publish all of our packages to a custom conda channel in addition to publishing them to our PyPI server? (so basically, we'd just continue to use conda and skip pip altogether)

phitoduck avatar Sep 04 '22 03:09 phitoduck

https://docs.metaflow.org/scaling/dependencies/libraries#pypi-in-action

savingoyal avatar Oct 06 '23 14:10 savingoyal

Thank you so much! This is great!

xujiboy avatar Oct 06 '23 15:10 xujiboy