fastparquet
fastparquet copied to clipboard
Install numpy-1.20.0rc1 causing errors
What happened:
Package versions before 07.12.2020
fastparquet-0.4.1 llvmlite-0.34.0 numba-0.51.2 numpy-1.19.2 packaging-20.4 pandas-1.1.3 pyparsing-2.4.7 python-dateutil-2.8.1 python-snappy-0.5.4 pytz-2020.1 thrift-0.13.0
Since 07.12.2020 I started getting an fastparquet error on Python 3.6
Collecting fastparquet==0.4.1 Downloading fastparquet-0.4.1.tar.gz (28.6 MB) ERROR: Command errored out with exit status 1: command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-7obzjc0l/fastparquet/setup.py'"'"'; __file__='"'"'/tmp/pip-install-7obzjc0l/fastparquet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-b0abneeh cwd: /tmp/pip-install-7obzjc0l/fastparquet/ Complete output (68 lines): Traceback (most recent call last): File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 154, in save_modules yield saved File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context yield File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 250, in run_setup _execfile(setup_script, ns) File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 45, in _execfile exec(code, globals, locals) File "/tmp/easy_install-ndh2xtme/numpy-1.20.0rc1/setup.py", line 30, in <module> extra = {} RuntimeError: Python version >= 3.7 required. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-7obzjc0l/fastparquet/setup.py", line 98, in <module> **extra File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 128, in setup _install_setup_requires(attrs) File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 123, in _install_setup_requires dist.fetch_build_eggs(dist.setup_requires) File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 513, in fetch_build_eggs replace_conflicting=True, File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 774, in resolve replace_conflicting=replace_conflicting File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1057, in best_match return self.obtain(req, installer) File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1069, in obtain return installer(requirement) File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 580, in fetch_build_egg return cmd.easy_install(req) File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 698, in easy_install return self.install_item(spec, dist.location, tmpdir, deps) File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 724, in install_item dists = self.install_eggs(spec, download, tmpdir) File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 909, in install_eggs return self.build_and_install(setup_script, setup_base) File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 1177, in build_and_install self.run_setup(setup_script, setup_base, args) File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 1163, in run_setup run_setup(setup_script, args) File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 253, in run_setup raise File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__ self.gen.throw(type, value, traceback) File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context yield File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__ self.gen.throw(type, value, traceback) File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 166, in save_modules saved_exc.resume() File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 141, in resume six.reraise(type, exc, self._tb) File "/usr/lib/python3/dist-packages/setuptools/_vendor/six.py", line 685, in reraise raise value.with_traceback(tb) File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 154, in save_modules yield saved File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context yield File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 250, in run_setup _execfile(setup_script, ns) File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 45, in _execfile exec(code, globals, locals) File "/tmp/easy_install-ndh2xtme/numpy-1.20.0rc1/setup.py", line 30, in <module> extra = {} RuntimeError: Python version >= 3.7 required. ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. WARNING: You are using pip version 20.2.4; however, version 20.3.1 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command. ERROR: Service 'aviation-pipelines-service' failed to build : The command '/bin/sh -c pip3 install -r requirements.python.txt' returned a non-zero code: 1
It was working fine for 2 months as everything was installed in Docker. So I upgraded python to 3.7.5 as required now (why?)
Strange that fastparquet try to install RC version of numpy
File "/tmp/easy_install-ndh2xtme/numpy-1.20.0rc1/setup.py", line 30, in <module>
Now running script:
import sys, getopt
import pandas as pd
import warnings
def main(argv):
inputfile = ''
outputfile = ''
try:
opts, args = getopt.getopt(argv,"hi:o:",["file=", "ifile=","ofile="])
except getopt.GetoptError:
print('test.py -i <inputfile> -o <outputfile>')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print('test.py -i <inputfile> -o <outputfile>')
sys.exit()
elif opt in ("-i", "--ifile"):
inputfile = arg
elif opt in ("-o", "--ofile"):
outputfile = arg
df = pd.read_parquet(inputfile, engine='fastparquet')
df.to_csv(outputfile)
print('Done')
if __name__ == "__main__":
main(sys.argv[1:])
I have errors
Traceback (most recent call last): File "/home/node/app/src/core/parquet/convert-to-csv.py", line 29, in <module> main(sys.argv[1:]) File "/home/node/app/src/core/parquet/convert-to-csv.py", line 23, in main df = pd.read_parquet(inputfile, engine='fastparquet') File "/usr/local/lib/python3.7/dist-packages/pandas/io/parquet.py", line 316, in read_parquet impl = get_engine(engine) File "/usr/local/lib/python3.7/dist-packages/pandas/io/parquet.py", line 44, in get_engine return FastParquetImpl() File "/usr/local/lib/python3.7/dist-packages/pandas/io/parquet.py", line 155, in __init__ "fastparquet", extra="fastparquet is required for parquet support." File "/usr/local/lib/python3.7/dist-packages/pandas/compat/_optional.py", line 107, in import_optional_dependency module = importlib.import_module(name) File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1006, in _gcd_import File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 728, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/usr/local/lib/python3.7/dist-packages/fastparquet/__init__.py", line 5, in <module> from .core import read_thrift File "/usr/local/lib/python3.7/dist-packages/fastparquet/core.py", line 9, in <module> from . import encoding File "/usr/local/lib/python3.7/dist-packages/fastparquet/encoding.py", line 19, in <module> from .speedups import unpack_byte_array File "__init__.pxd", line 242, in init fastparquet.speedups ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
Environment:
- Python version: from 3.6 to 3.8
- Docker Image on Ubuntu 18.04:
- Pip: 20.2.3
Interesting is that
FROM ubuntu:18.04 AS development
USER root:root
RUN \
apt-get update && apt-get install -y curl make gcc g++ cmake python3.6 python3.6-dev python3-pip gnupg libsnappy-dev
COPY --chown=root:root requirements.python.txt ./
RUN pip3 install --upgrade pip==20.2.3
RUN pip3 --version
RUN pip3 install -r requirements.python.txt
return first error from description above
and when i install each lib separately it works fine
FROM ubuntu:18.04 AS development
USER root:root
RUN \
apt-get update && apt-get install -y curl make gcc g++ cmake python3.6 python3.6-dev python3-pip gnupg libsnappy-dev
COPY --chown=root:root requirements.python.txt ./
RUN pip3 install --upgrade pip==20.2.3
RUN pip3 --version
RUN pip3 install numpy==1.18.0
RUN pip3 install pandas==1.1.3
RUN pip3 install fastparquet==0.4.1
RUN pip3 install python-snappy==0.5.4
no error at all everything working fine
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
this is the critical thing, I think, and has shown up before. When this happens during import, it's just a warning. Probably there's something about it on the numpy tracker
What actually happens during pip install depends on your situation. The binary wheel is built against a specific version of numpy; but if you build from source you will build against the currently installed numpy, after either recreating the C code with cython or not.
I don't know why pip would be picking the RC numpy...
Note: I generally install using conda to avoid such problems.
When this happens during import, it's just a warning
Hm, maybe not - but there are similar warnings around about the size of dtype.
It's not a warning but an exception. We alse see it in different projects.
All these projects are using Dockerfiles containing a pip install -r requirements.txt.
RUN
apt-get update && apt-get install -y curl make gcc g++ cmake python3.6 python3.6-dev python3-pip gnupg libsnappy-devCOPY --chown=root:root requirements.python.txt ./
RUN pip3 install --upgrade pip==20.2.3 RUN pip3 --version
RUN pip3 install numpy==1.18.0 RUN pip3 install pandas==1.1.3 RUN pip3 install fastparquet==0.4.1 RUN pip3 install python-snappy==0.5.4
no error at all everything working fine
I am able to reproduce the problem using a Dockerfile. The dockerfile contains a RUN pip install -t <custom location> -r requirements.txt.
To reproduce I add a python file test.py. Dockerfile: RUN python test.py.
test.py:
#!/usr/bin/env python
import fastparquet
Then to veryfy the workaround.
I inserted these new lines in Dockerfile:
RUN pip install numpy==1.18.0
RUN pip install fastparquet==0.4.1
RUN pip install python-snappy==0.5.4
Note that these install into the system global site-packages location, i.e. /usr/local/lib/python3.7/lib/site-packages or something similar; not to the custom app location. Now it works. The sequence (numpy before fastparquet) and the location (system global instead of app directory) matter.
If you have a suggestion for the right incantation for requests.txt, please comment in #538 .
Based on the intent of https://github.com/dask/fastparquet/blob/master/setup.py#L25-L29, this block looks problematic https://github.com/dask/fastparquet/blob/master/setup.py#L74-L76 (includes numpy in no matter what command is being run). The stackoverflow link in the block has a less-upvoted answer that links to how scipy includes numpy - https://github.com/scipy/scipy/blob/master/setup.py#L566 . Seems like they have a more robust solution (probably don't need all of it) that can be used here?
Seems like they have a more robust solution (probably don't need all of it) that can be used here?
Willing to try it! Do you want to put in a PR? I think so long as the CI build does a python setup.py or pip install, that should be test enough.
Given that we encountered this issue on the job and have a workaround (install numpy first), it's unlikely I'll get to this within the workweek; mostly wanted to offer a solution for anyone interested in fixing this properly. That said, I may have time over the weekend to work on this, but no promises!
A couple of extra questions:
- does this problem happen on py37-39?
- how about if you don't upgrade the version of pip?
The following works... Note that fastparquet 0.4.1 is not supposed to work on py36 any more, so you should go back in versions to find one that does. I don't know if that would also fix the numpy version problem
FROM ubuntu:18.04 AS development
USER root:root
RUN \
apt-get update && apt-get install -y curl make gcc g++ cmake python3.7 python3.7-dev python3-pip gnupg libsnappy-dev
RUN python3.7 -m pip install --upgrade pip==20.2.3
RUN python3.7 -m pip --version
RUN python3.7 -m pip install fastparquet
If that's enough of a solution, please close this; but in any case, I won't hold v0.5.0 over this.
Another solution seems to disable installing fastparquet as a binary package by using the --no-binary flag documented here. In requirements.txt, the line can be: fastparquet --no-binary=fastparquet as per this StackOverflow post.
Not to piggyback too much on an old issue, but there is a new warning that comes with this numpy.
../../../../../usr/share/miniconda3/envs/test-environment/lib/python3.8/site-packages/fastparquet/writer.py:70
/usr/share/miniconda3/envs/test-environment/lib/python3.8/site-packages/fastparquet/writer.py:70: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. Use `bool` by itself, which is identical in behavior, to silence this warning. If you specifically wanted the numpy scalar type, use `np.bool_` here.
pd.BooleanDtype(): np.bool
Fix: https://github.com/dask/fastparquet/pull/551
FWIW fastparquet.0.6.0post1 doesn't work at all due to this bug, whereas fastparquet.0.5.0 used to work.
numpy 1.19.4.
Would it be best to not publish a binary wheel and let the user build it for their local version of numpy?