metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

`metaflow_environment` dependencies can override or conflict with those set by the Batch docker image, breaking user code

Open ryan-williams opened this issue 3 years ago • 5 comments

Pasting the README from runsascoded/mf-pip-issue, where I have some repro files as well:

Metaflow/pip/Batch issue

Metaflow runs pip install awscli … boto3 while setting up task environements in Batch, which can break aiobotocore<2.1.0.

Repro

Docker image runsascoded/mf-pip-issue-batch (batch.dockerfile) pins recent versions of botocore and aiobotocore:

Local mode: ✅

They work fine together normally; runsascoded/mf-pip-issue-local (local.dockerfile) runs s3_flow_test.py successfully (in "local" mode):

docker run -it --rm runsascoded/mf-pip-issue-local
# Metaflow 2.4.8 executing S3FlowTest for user:user
# …
# 2022-01-16 21:21:59.162 Done!

Batch mode: ❌

However, with a Metaflow Batch queue configured:

python s3_flow_test.py run --with batch:image=runsascoded/mf-pip-issue-batch

fails with:

AttributeError: 'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'

due to a version mismatch (botocore>=1.23.0, aiobotocore<2.1.0).

Version mismatch

botocore removed ClientCreator._register_lazy_block_unknown_fips_pseudo_regions in 1.23.0, and aiobotocore only updated to botocore>=1.23.0 in 2.1.0, so aiobotocore<2.1.0 requires botocore<1.23.0, otherwise reading from S3 via Pandas will raise this error.

Cause

The version mismatch is caused by Metaflow running pip install awscli … boto3 while setting up the task environment (in Batch and I believe k8s). If awscli or boto3 aren't both installed already, it will pick a recent version to install, see that a recent botocore is also required by that version, and update botocore to >=1.23.0 while aiobotocore is still <2.1.0, breaking Pandas→S3 reading.

Simpler example

Here we see pip install awscli break aiobotocore<2.1.0 directly (in the same image as above):

docker run --rm --entrypoint bash runsascoded/mf-pip-issue-batch -c '
  echo "Before \`pip install awscli\`:" && \
  pip list | grep botocore && \
  pip install awscli -qqq && \
  echo -e "----\nAfter \`pip install awscli\`:" && \
  pip list | grep botocore
' 2>/dev/null 
# Before `pip install awscli`:
# aiobotocore        1.4.2     # ✅
# botocore           1.20.106  # ✅
# ----
# After `pip install awscli`:
# aiobotocore        1.4.2     # ✅
# botocore           1.23.37   # ❌

Here, pip install awscli upgraded botocore to a version that's incompatible with the already-installed aiobotocore.

Workaround

The simplest workaround I've found is to ensure Metaflow's pip install awscli click requests boto3 command no-ops, by having some version of those libraries already installed in the image. They should also have consistent transitive dependency versions, otherwise pip install will "help" with those as well).

Scratch

These seem like the minimal Metaflow configs to submit to Batch (and reproduce the issue):

{
  "METAFLOW_BATCH_JOB_QUEUE": "arn:aws:batch:…",
  "METAFLOW_ECS_S3_ACCESS_IAM_ROLE": "arn:aws:iam::…",
  "METAFLOW_DEFAULT_DATASTORE": "s3",
  "METAFLOW_DATASTORE_SYSROOT_S3": "s3://<bucket>/metaflow",
  "METAFLOW_DATATOOLS_SYSROOT_S3": "s3://<bucket>/data"
}

Docker build commands:

docker build -f batch.dockerfile -t runsascoded/mf-pip-issue-batch .
docker build -f local.dockerfile -t runsascoded/mf-pip-issue-local .

ryan-williams avatar Jan 16 '22 23:01 ryan-williams

@ryan-williams The pip install awscli ... should be a no-op for any of the libraries that are already present in the image.

savingoyal avatar Jan 18 '22 18:01 savingoyal

Yes, but if e.g. awscli isn't already installed, installing it can change the versions of things that are already installed, including breaking them. The "Simpler example" section above illustrates this most directly.

ryan-williams avatar Jan 18 '22 20:01 ryan-williams

To be clear, it's possible for the following to happen:

  • user builds image with valid *boto* versions
  • user sets that image as $METAFLOW_BATCH_CONTAINER_IMAGE, runs a flow --with batch
  • flow fails because boto versions in the step environment are broken:
    • before running the step, Metaflow ran its own pip install in the container
    • that pip install inadvertently changed the versions of things the user had already installed in the image (namely botocore), resulting in other things the user installed (aiobotocore) being broken

I don't know what the solution should be, but it is surprising and undesirable behavior, and enabled by a breaking change in boto in November that I suspect we will see wash around the ecosystem for some time to come, so it's good to be aware of this specific interaction with Metaflow's step-env setup logic.

ryan-williams avatar Jan 19 '22 17:01 ryan-williams

Ran into this again today. Here's an updated link to the offending line, in 2.8.2.

Here's a simple repro:

1. User installs boto/s3fs/pandas, successfully reads CSV from S3

# mf1.dockerfile
FROM python:3.9
WORKDIR /root
RUN pip install \
    boto3==1.24.59 \
    botocore==1.27.59 \
    aiobotocore==2.4.2 \
    s3fs==2023.1.0 \
    pandas
# ✅ works fine, reads publicly-accessible CSV from S3. boto/s3fs/pandas versions are mutually compatible.
ENTRYPOINT [ "python", "-c", "import pandas as pd; print(pd.read_csv('s3://ctbk/csvs/JC-202301-citibike-tripdata.csv'))" ]
docker build -tmf1 -fmf1.dockerfile .
docker run --rm -it mf1
✅ works fine, prints DataFrame
                ride_id  rideable_type  ...    end_lng member_casual
0      0905B18B365C9D20   classic_bike  ... -74.044247        member
1      B4F0562B05CB5404  electric_bike  ... -74.041664        member
2      5ABF032895F5D87E   classic_bike  ... -74.042521        member
3      E7E1F9C53976D2F9   classic_bike  ... -74.044247        member
4      323165780CA0734B   classic_bike  ... -74.042884        member
...                 ...            ...  ...        ...           ...
56070  17CD2F4ABD4F6785   classic_bike  ... -74.050389        member
56071  D75D12846E6838D0  electric_bike  ... -74.050389        member
56072  36387397177CAA80  electric_bike  ... -74.050389        member
56073  B66278F45420CFA0   classic_bike  ... -74.030305        member
56074  230153A8D1F2D5F7   classic_bike  ... -74.030305        member

[56075 rows x 13 columns]

2. Metaflow runs pip install awscli boto3, breaking aiobotocore/s3fs/pandas

# mf2.dockerfile
FROM mf1
RUN pip install awscli boto3  # 💥 this breaks the user's installs; `pd.read_csv("s3://…")` no longer works

Test image:

docker build -tmf2 -fmf2.dockerfile .
docker run --rm -it mf2
pd.read_csv raises PermissionError: Forbidden
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 112, in _error_wrapper
    return await func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/aiobotocore/client.py", line 358, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
    self.handles = get_handle(
  File "/usr/local/lib/python3.9/site-packages/pandas/io/common.py", line 716, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/usr/local/lib/python3.9/site-packages/pandas/io/common.py", line 425, in _get_filepath_or_buffer
    file_obj = fsspec.open(
  File "/usr/local/lib/python3.9/site-packages/fsspec/core.py", line 134, in open
    return self.__enter__()
  File "/usr/local/lib/python3.9/site-packages/fsspec/core.py", line 102, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/usr/local/lib/python3.9/site-packages/fsspec/spec.py", line 1135, in open
    f = self._open(
  File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 649, in _open
    return S3File(
  File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 2024, in __init__
    super().__init__(
  File "/usr/local/lib/python3.9/site-packages/fsspec/spec.py", line 1491, in __init__
    self.size = self.details["size"]
  File "/usr/local/lib/python3.9/site-packages/fsspec/spec.py", line 1504, in details
    self._details = self.fs.info(self.path)
  File "/usr/local/lib/python3.9/site-packages/fsspec/asyn.py", line 114, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/fsspec/asyn.py", line 99, in sync
    raise return_result
  File "/usr/local/lib/python3.9/site-packages/fsspec/asyn.py", line 54, in _runner
    result[0] = await coro
  File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 1238, in _info
    out = await self._call_s3(
  File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 339, in _call_s3
    return await _error_wrapper(
  File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 139, in _error_wrapper
    raise err
PermissionError: Forbidden

pip install awscli boto3 explicitly logs an ERROR about breaking aiobotocore:

docker run --rm -it --entrypoint pip mf1 install awscli boto3
# …
# ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
# aiobotocore 2.4.2 requires botocore<1.27.60,>=1.27.59, but you have botocore 1.29.110 which is incompatible.
# Successfully installed PyYAML-5.4.1 awscli-1.27.110 boto3-1.26.110 botocore-1.29.110 colorama-0.4.4 docutils-0.16 pyasn1-0.4.8 rsa-4.7.2

Simplest workaround remains to make sure both awscli and boto3 are both installed in any image you pass to Metaflow Batch mode, but Metaflow could/should do something more careful/correct here.

ryan-williams avatar Apr 10 '23 20:04 ryan-williams