pandera PyArrow as optional dependency

Is your feature request related to a problem? Please describe.

The PyArrow package contains some very large libraries (e.g., libarrow.so (50MB) and libarrow_flight.so (14M)). This makes it very hard to use the Pandera package in a serverless environment, since packages have strict size limits and PyArrow is required. Hence, the issue is that Pandera is practically unusable for AWS Lambda.

Describe the solution you'd like

It seems that PyArrow is not really part of the core of Pandera. Therefore, I would like to suggest to make pyarrow an optional dependency to allow Pandera to be used in environment with strict size constraints.

Describe alternatives you've considered

Not applicable.

Additional context

Not applicable.

May 10 '22 11:05 markkvdb

this use case makes sense... I agree pyarrow should be an optional dependency (not a package extra).

Removing this as a dependency from the project, we need to:

remove it from environment.yml file
run python scripts/generate_pip_deps_from_conda.py to sync change to requirements-dev.txt
remove it from setup.py file.

Additionally, we should explicitly raise a TypeError when users try to use the pyarrow pandas string format string[pyarrow], which pandera translates to a pandera DataType here: https://github.com/pandera-dev/pandera/blob/master/pandera/engines/pandas_engine.py#L464-L483. Can probably raise the error in the __post_init__ method if storage == "pyarrow".

@jeffzi anything I might have missed here?

May 16 '22 02:05 cosmicBboy

@markkvdb are you open to making a PR for this change?

May 19 '22 14:05 cosmicBboy

Hi @cosmicBboy, I am willing to invest some of my time to make the PR, as I would really love to have to possibility to use Pandera in AWS Lambda. So I tried to make the changes you suggested, but it breaks 2 tests.

Test 1:

tests/pyspark/test_schemas_on_pyspark.py:7: in <module>
    import pyspark.pandas as ps
../../.asdf/installs/python/3.8.13/lib/python3.8/site-packages/pyspark/pandas/__init__.py:32: in <module>
    require_minimum_pyarrow_version()
../../.asdf/installs/python/3.8.13/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:60: in require_minimum_pyarrow_version
    raise ImportError(
E   ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.

Test 2:

tests/core/test_dtypes.py:118: in <module>
    {pd.StringDtype(storage="pyarrow"): "string[pyarrow]"}
../../.asdf/installs/python/3.8.13/lib/python3.8/site-packages/pandas/core/arrays/string_.py:108: in __init__
    raise ImportError(
E   ImportError: pyarrow>=1.0.0 is required for PyArrow backed StringArray.

PyArrow is needed for Pyspark, any tips about how I can make it work?

Jul 13 '22 09:07 JonathanBonnaud

Sorry I missed the ping @cosmicBboy.

@JonathanBonnaud Thanks for looking into it !

Test 1

remove it from environment.yml file run python scripts/generate_pip_deps_from_conda.py to sync change to requirements-dev.txt

@cosmicBboy I think we need to keep pyarrow as a dev dependency for pandas tests, but we can indeed remove it from setup.py for user facing dependencies. That should solve the pandas test failing @JonathanBonnaud

Test 2

We can depend on pyspark[pandas_on_spark] instead of just pyspark. It will guarantee we have what we need for the pyspark tests https://github.com/apache/spark/blob/899f6c90eb2de5b46a36710a131d7417010ce4b3/python/setup.py#L271-L274

@markkvdb Until the PR, and the next release including it, a solution is to leverage aws lambda layers. Pandas, numpy, scipy, pyarrow, awswrangler are all heavy dependencies that are ideally shared among regular aws lambda functions as layers. Depending on your deployment tool, you can ignore dependencies that are already in a layer. For example, serverless + its python plugin is capable of ignoring specific dependencies when deploying the zipped function. awswrangler offers a ready-to-go layer with pyarrow and pandas included.

Jul 13 '22 22:07 jeffzi

@jeffzi I didn't know about Serverless ignoring dependencies already specified in the layers. That's good to know! Actually, the Serverless python plugin is far from perfect. For example, you are supposed to specify certain dependencies not to be deployed but if it's combined with Poetry this setting is (silently) ignored.

Another important issue is that layers do actually count for the total size of the deployment package. So splitting packages by layer will not help with that.

Jul 14 '22 09:07 markkvdb

Unfortunately serverless does not automatically ignore layer dependencies. You have to list them in the python plugin config.

you are supposed to specify certain dependencies not to be deployed but if it's combined with Poetry this setting is (silently) ignored.

I also use poetry and I agree the plugin is inconsistent. In some cases, I had to resort to using slimPatterns to force ignore deps:

  pythonRequirements:
    useStaticCache: false
    useDownloadCache: false
    zip: false
    slim: true
    slimPatterns:
      - "botocore*"
      - "cache*"
      - "lxml*"

Another important issue is that layers do actually count for the total size of the deployment package. So splitting packages by layer will not help with that.

Forgot about that :( At work I have a dedicated layer for pyarrow. The pyarrow layer strips the *so files your mentioned. It reduces the size, but it's obviously better to remove pyarrow as a dependency of pandera.

Jul 14 '22 10:07 jeffzi