pandera
pandera copied to clipboard
PyArrow as optional dependency
Is your feature request related to a problem? Please describe.
The PyArrow package contains some very large libraries (e.g., libarrow.so (50MB) and libarrow_flight.so (14M)). This makes it very hard to use the Pandera package in a serverless environment, since packages have strict size limits and PyArrow is required. Hence, the issue is that Pandera is practically unusable for AWS Lambda.
Describe the solution you'd like
It seems that PyArrow is not really part of the core of Pandera. Therefore, I would like to suggest to make pyarrow an optional dependency to allow Pandera to be used in environment with strict size constraints.
Describe alternatives you've considered
Not applicable.
Additional context
Not applicable.
this use case makes sense... I agree pyarrow should be an optional dependency (not a package extra).
Removing this as a dependency from the project, we need to:
- remove it from
environment.ymlfile - run
python scripts/generate_pip_deps_from_conda.pyto sync change to requirements-dev.txt - remove it from
setup.pyfile.
Additionally, we should explicitly raise a TypeError when users try to use the pyarrow pandas string format string[pyarrow], which pandera translates to a pandera DataType here: https://github.com/pandera-dev/pandera/blob/master/pandera/engines/pandas_engine.py#L464-L483. Can probably raise the error in the __post_init__ method if storage == "pyarrow".
@jeffzi anything I might have missed here?
@markkvdb are you open to making a PR for this change?
Hi @cosmicBboy, I am willing to invest some of my time to make the PR, as I would really love to have to possibility to use Pandera in AWS Lambda. So I tried to make the changes you suggested, but it breaks 2 tests.
- Test 1:
tests/pyspark/test_schemas_on_pyspark.py:7: in <module>
import pyspark.pandas as ps
../../.asdf/installs/python/3.8.13/lib/python3.8/site-packages/pyspark/pandas/__init__.py:32: in <module>
require_minimum_pyarrow_version()
../../.asdf/installs/python/3.8.13/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:60: in require_minimum_pyarrow_version
raise ImportError(
E ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.
- Test 2:
tests/core/test_dtypes.py:118: in <module>
{pd.StringDtype(storage="pyarrow"): "string[pyarrow]"}
../../.asdf/installs/python/3.8.13/lib/python3.8/site-packages/pandas/core/arrays/string_.py:108: in __init__
raise ImportError(
E ImportError: pyarrow>=1.0.0 is required for PyArrow backed StringArray.
PyArrow is needed for Pyspark, any tips about how I can make it work?
Sorry I missed the ping @cosmicBboy.
@JonathanBonnaud Thanks for looking into it !
- Test 1
remove it from environment.yml file run python scripts/generate_pip_deps_from_conda.py to sync change to requirements-dev.txt
@cosmicBboy I think we need to keep pyarrow as a dev dependency for pandas tests, but we can indeed remove it from setup.py for user facing dependencies. That should solve the pandas test failing @JonathanBonnaud
- Test 2
We can depend on pyspark[pandas_on_spark] instead of just pyspark. It will guarantee we have what we need for the pyspark tests
https://github.com/apache/spark/blob/899f6c90eb2de5b46a36710a131d7417010ce4b3/python/setup.py#L271-L274
@markkvdb Until the PR, and the next release including it, a solution is to leverage aws lambda layers. Pandas, numpy, scipy, pyarrow, awswrangler are all heavy dependencies that are ideally shared among regular aws lambda functions as layers. Depending on your deployment tool, you can ignore dependencies that are already in a layer. For example, serverless + its python plugin is capable of ignoring specific dependencies when deploying the zipped function. awswrangler offers a ready-to-go layer with pyarrow and pandas included.
@jeffzi I didn't know about Serverless ignoring dependencies already specified in the layers. That's good to know! Actually, the Serverless python plugin is far from perfect. For example, you are supposed to specify certain dependencies not to be deployed but if it's combined with Poetry this setting is (silently) ignored.
Another important issue is that layers do actually count for the total size of the deployment package. So splitting packages by layer will not help with that.
Unfortunately serverless does not automatically ignore layer dependencies. You have to list them in the python plugin config.
you are supposed to specify certain dependencies not to be deployed but if it's combined with Poetry this setting is (silently) ignored.
I also use poetry and I agree the plugin is inconsistent. In some cases, I had to resort to using slimPatterns to force ignore deps:
pythonRequirements:
useStaticCache: false
useDownloadCache: false
zip: false
slim: true
slimPatterns:
- "botocore*"
- "cache*"
- "lxml*"
Another important issue is that layers do actually count for the total size of the deployment package. So splitting packages by layer will not help with that.
Forgot about that :( At work I have a dedicated layer for pyarrow. The pyarrow layer strips the *so files your mentioned. It reduces the size, but it's obviously better to remove pyarrow as a dependency of pandera.