adlfs icon indicating copy to clipboard operation
adlfs copied to clipboard

Read fails right after write of pandas parquet file to Azure

Open PeterFogh opened this issue 3 years ago • 0 comments

When running the code below, with the conda env at the bottom (sorry I can not attach the YAML as a file!), it results in the exception described below. First read (right after write) fails after ~1 min with "HttpResponseError: Server encountered an internal error. Please try again after some time.", but following reads are successful. The reason is unknown, but a process sleep of 2 minutes before reading the file does not solve the problem. So I suspect it is due to unclosed file-pointers, which are first closed when the exception is raised!

What happened: I get the exception:

Traceback (most recent call last):
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/azure/storage/blob/aio/_list_blobs_helper.py", line 70, in _get_next_cb
    return await self._command(
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/azure/storage/blob/_generated/aio/operations_async/_container_operations_async.py", line 1329, in list_blob_hierarchy_segment
    raise models.StorageErrorException(response, self._deserialize)
azure.storage.blob._generated.models._models_py3.StorageErrorException: Operation returned an invalid status 'Server encountered an internal error. Please try again after some time.'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 26, in <module>
    df = pd.read_parquet(
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet
    return impl.read(
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/pandas/io/parquet.py", line 312, in read
    parquet_file = self.api.ParquetFile(path, **parquet_kwargs)
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/fastparquet/api.py", line 110, in __init__
    with open_with(fn2, 'rb') as f:
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/pandas/io/parquet.py", line 303, in <lambda>
    parquet_kwargs["open_with"] = lambda path, _: fsspec.open(
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/fsspec/core.py", line 134, in open
    out = self.__enter__()
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/fsspec/core.py", line 102, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/fsspec/spec.py", line 930, in open
    f = self._open(
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/adlfs/spec.py", line 1424, in _open
    return AzureBlobFile(
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/adlfs/spec.py", line 1528, in __init__
    self.details = self.fs.info(self.path)
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/adlfs/spec.py", line 524, in info
    return maybe_sync(self._info, self, path)
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/adlfs/spec.py", line 545, in _info
    out = await self._ls(path, **kwargs)
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/adlfs/spec.py", line 721, in _ls
    async for next_blob in blobs:
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/azure/core/async_paging.py", line 154, in __anext__
    return await self.__anext__()
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/azure/core/async_paging.py", line 157, in __anext__
    self._page = await self._page_iterator.__anext__()
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/azure/core/async_paging.py", line 99, in __anext__
    self._response = await self._get_next(self.continuation_token)
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/azure/storage/blob/aio/_list_blobs_helper.py", line 77, in _get_next_cb
    process_storage_error(error)
  File "/home/fogh/miniconda3/envs/py38_ADLS_POC/lib/python3.8/site-packages/azure/storage/blob/_shared/response_handlers.py", line 147, in process_storage_error
    raise error
azure.core.exceptions.HttpResponseError: Server encountered an internal error. Please try again after some time.
RequestId:5b49de07-d01e-0129-05b8-4b8c02000000
Time:2021-05-18T07:36:15.2263076Z
ErrorCode:InternalError
Error:None
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f5825aee820>

What you expected to happen: The read executes without exception.

Minimal Complete Verifiable Example:

import os
from pathlib import Path

import adlfs
from dotenv import load_dotenv
import pandas as pd

load_dotenv()

storage_options = dict(
    tenant_id=os.getenv('TENANT_ID'),
    client_id=os.getenv('CLIENT_ID'),
    client_secret=os.getenv('CLIENT_SECRET'))
abfs_slz = adlfs.AzureBlobFileSystem(
    **storage_options, account_name='<account_name>')

adl_folder = Path('<ADL path>')
parquet_file = adl_folder/'test.parquet.brotli'
df = pd.DataFrame(data={'col1': [1, 2], 'col2': ['A', 'b']})

df.to_parquet(
    f'az://{parquet_file}', storage_options=abfs_slz.storage_options,
    compression='BROTLI')

df = pd.read_parquet(
    f'az://{parquet_file}',
    storage_options=abfs_slz.storage_options)
print(df)

Anything else we need to know?: Nope

Environment:

  • Dask version: 2021.1.1=pyhd3eb1b0_0
  • Python version: 1.2.1=py38ha9443f7_0
  • Operating System: Ubuntu 20.04.2 LTS
  • Install method (conda, pip, source): Conda
name: py38_ADLS_POC
channels:
  - defaults
  - conda-forge
  - r
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_gnu
  - adal=1.2.6=pyhd3eb1b0_0
  - adlfs=0.6.0=pyhd8ed1ab_0
  - affine=2.3.0=py_0
  - aiohttp=3.6.3=py38h7b6447c_0
  - argon2-cffi=20.1.0=py38h7b6447c_1
  - asciitree=0.3.3=py_2
  - async-timeout=3.0.1=py38h06a4308_0
  - async_generator=1.10=pyhd3eb1b0_0
  - attrs=20.3.0=pyhd3eb1b0_0
  - azure-core=1.10.0=pyhd8ed1ab_0
  - azure-datalake-store=0.0.51=pyh9f0ad1d_0
  - azure-identity=1.5.0=pyhd8ed1ab_0
  - azure-storage-blob=12.6.0=pyhd3deb0d_0
  - backcall=0.2.0=pyhd3eb1b0_0
  - blas=1.0=mkl
  - bleach=3.3.0=pyhd3eb1b0_0
  - blinker=1.4=py38h06a4308_0
  - bokeh=2.2.3=py38_0
  - boost-cpp=1.74.0=h9359b55_0
  - brotlipy=0.7.0=py38h27cfd23_1003
  - bzip2=1.0.8=h7b6447c_0
  - c-ares=1.17.1=h27cfd23_0
  - ca-certificates=2021.1.19=h06a4308_0
  - cairo=1.16.0=h9f066cc_1006
  - certifi=2020.12.5=py38h06a4308_0
  - cffi=1.14.4=py38h261ae71_0
  - cfitsio=3.470=hf0d0db6_6
  - chardet=3.0.4=py38h06a4308_1003
  - click=7.1.2=pyhd3eb1b0_0
  - click-plugins=1.1.1=py_0
  - cligj=0.7.1=py38h06a4308_0
  - cloudpickle=1.6.0=py_0
  - cryptography=3.3.1=py38h3c74f83_0
  - curl=7.71.1=he644dc0_8
  - cycler=0.10.0=py38_0
  - cytoolz=0.11.0=py38h7b6447c_0
  - dask=2021.1.1=pyhd3eb1b0_0
  - dask-core=2021.1.1=pyhd3eb1b0_0
  - dbus=1.13.18=hb2f20db_0
  - decorator=4.4.2=pyhd3eb1b0_0
  - defusedxml=0.6.0=py_0
  - distributed=2021.1.1=py38h06a4308_1
  - entrypoints=0.3=py38_0
  - expat=2.2.10=he6710b0_2
  - fasteners=0.16=pyhd3eb1b0_0
  - fastparquet=0.5.0=py38h6323ea4_1
  - fontconfig=2.13.1=hba837de_1004
  - freetype=2.10.4=h5ab3b9f_0
  - freexl=1.0.6=h27cfd23_0
  - fsspec=0.8.5=pyhd8ed1ab_0
  - geos=3.8.1=he6710b0_0
  - geotiff=1.6.0=h5d11630_3
  - gettext=0.19.8.1=h9b4dc7a_1
  - giflib=5.2.1=h7b6447c_0
  - glib=2.66.4=hc4f0c31_2
  - glib-tools=2.66.4=hc4f0c31_2
  - gst-plugins-base=1.14.5=h0935bb2_2
  - gstreamer=1.18.3=h3560a44_0
  - hdf4=4.2.13=h3ca952b_2
  - hdf5=1.10.6=nompi_h3c11f04_101
  - heapdict=1.0.1=py_0
  - icu=67.1=he1b5a44_0
  - idna=2.10=pyhd3eb1b0_0
  - imageio=2.9.0=py_0
  - importlib-metadata=2.0.0=py_1
  - importlib_metadata=2.0.0=1
  - intel-openmp=2020.2=254
  - ipykernel=5.3.4=py38h5ca1d4c_0
  - ipython=7.20.0=py38hb070fc8_1
  - ipython_genutils=0.2.0=pyhd3eb1b0_1
  - isodate=0.6.0=py_1
  - jedi=0.17.2=py38h06a4308_1
  - jinja2=2.11.3=pyhd3eb1b0_0
  - jpeg=9d=h36c2ea0_0
  - json-c=0.13.1=h1bed415_0
  - jsonschema=3.2.0=py_2
  - jupyter_client=6.1.7=py_0
  - jupyter_core=4.7.1=py38h06a4308_0
  - jupyterlab_pygments=0.1.2=py_0
  - kealib=1.4.14=h0042707_0
  - kiwisolver=1.3.1=py38h2531618_0
  - krb5=1.17.1=h173b8e3_0
  - lcms2=2.11=h396b838_0
  - ld_impl_linux-64=2.33.1=h53a641e_7
  - libclang=11.0.1=default_ha53f305_1
  - libcurl=7.71.1=hcdd3856_8
  - libdap4=3.20.6=h1d1bd15_1
  - libedit=3.1.20191231=h14c3975_1
  - libev=4.33=h7b6447c_0
  - libevent=2.1.10=hcdb4288_3
  - libffi=3.3=he6710b0_2
  - libgcc-ng=9.3.0=h2828fa1_18
  - libgdal=3.1.4=h670eac6_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libglib=2.66.4=h748fe8e_2
  - libgomp=9.3.0=h2828fa1_18
  - libiconv=1.16=h516909a_0
  - libkml=1.3.0=h74f7ee3_1012
  - libllvm10=10.0.1=hbcb73fb_5
  - libllvm11=11.0.1=hf817b99_0
  - libnetcdf=4.7.4=nompi_h56d31a8_107
  - libnghttp2=1.41.0=hf8bcb03_2
  - libpng=1.6.37=hbc83047_0
  - libpq=12.3=h255efa7_3
  - libsodium=1.0.18=h7b6447c_0
  - libspatialite=5.0.0=heaf302f_0
  - libssh2=1.9.0=h1ba5d50_1
  - libstdcxx-ng=9.3.0=h6de172a_18
  - libtiff=4.1.0=h2733197_1
  - libuuid=2.32.1=h7f98852_1000
  - libwebp-base=1.2.0=h27cfd23_0
  - libxcb=1.14=h7b6447c_0
  - libxkbcommon=1.0.3=he3ba5ed_0
  - libxml2=2.9.10=h68273f3_2
  - llvmlite=0.34.0=py38h269e1b5_4
  - locket=0.2.1=py38h06a4308_1
  - lz4-c=1.9.2=heb0550a_3
  - markupsafe=1.1.1=py38h7b6447c_0
  - matplotlib=3.3.2=h06a4308_0
  - matplotlib-base=3.3.2=py38h817c723_0
  - mistune=0.8.4=py38h7b6447c_1000
  - mkl=2020.2=256
  - mkl-service=2.3.0=py38he904b0f_0
  - mkl_fft=1.2.0=py38h23d657b_0
  - mkl_random=1.1.1=py38h0573a6f_0
  - monotonic=1.5=py_0
  - msal=1.8.0=pyhd3deb0d_0
  - msal_extensions=0.3.0=pyh9f0ad1d_0
  - msgpack-python=1.0.2=py38hff7bd54_1
  - msrest=0.6.21=pyh44b312d_0
  - msrestazure=0.6.4=pyhd8ed1ab_0
  - multidict=4.7.6=py38h7b6447c_1
  - mysql-common=8.0.22=ha770c72_1
  - mysql-libs=8.0.22=h1fd7589_1
  - nb_conda_kernels=2.3.1=py38h06a4308_0
  - nbclient=0.5.1=py_0
  - nbconvert=6.0.7=py38_0
  - nbformat=5.1.2=pyhd3eb1b0_1
  - ncurses=6.2=he6710b0_1
  - nest-asyncio=1.4.3=pyhd3eb1b0_0
  - networkx=2.5=py_0
  - notebook=6.2.0=py38h06a4308_0
  - nspr=4.29=h9c3ff4c_1
  - nss=3.61=hb5efdd6_0
  - numba=0.51.2=py38h0573a6f_1
  - numcodecs=0.7.3=py38h2531618_0
  - numpy=1.19.2=py38h54aff64_0
  - numpy-base=1.19.2=py38hfa32c7d_0
  - oauthlib=3.1.0=py_0
  - olefile=0.46=py_0
  - openjpeg=2.3.1=hf7af979_3
  - openssl=1.1.1i=h27cfd23_0
  - packaging=20.9=pyhd3eb1b0_0
  - pandas=1.2.1=py38ha9443f7_0
  - pandoc=2.11=hb0f4dca_0
  - pandocfilters=1.4.3=py38h06a4308_1
  - parso=0.7.0=py_0
  - partd=1.1.0=py_0
  - pcre=8.44=he6710b0_0
  - pexpect=4.8.0=pyhd3eb1b0_3
  - pickleshare=0.7.5=pyhd3eb1b0_1003
  - pillow=8.1.0=py38he98fc37_0
  - pip=20.3.3=py38h06a4308_0
  - pixman=0.40.0=h7b6447c_0
  - poppler=0.89.0=h669c267_1
  - poppler-data=0.4.10=h06a4308_0
  - portalocker=2.2.0=py38h06a4308_0
  - postgresql=12.3=hc2f5b80_3
  - proj=7.1.1=h966b41f_3
  - prometheus_client=0.9.0=pyhd3eb1b0_0
  - prompt-toolkit=3.0.8=py_0
  - psutil=5.8.0=py38h27cfd23_1
  - ptyprocess=0.7.0=pyhd3eb1b0_2
  - pycparser=2.20=py_2
  - pygments=2.7.4=pyhd3eb1b0_0
  - pyjwt=1.7.1=py38_0
  - pyopenssl=20.0.1=pyhd3eb1b0_1
  - pyparsing=2.4.7=pyhd3eb1b0_0
  - pyproj=2.6.1.post1=py38h56787f0_3
  - pyqt=5.12.3=py38h578d9bd_7
  - pyqt-impl=5.12.3=py38h7400c14_7
  - pyqt5-sip=4.19.18=py38h709712a_7
  - pyqtchart=5.12=py38h7400c14_7
  - pyqtwebengine=5.12.1=py38h7400c14_7
  - pyrsistent=0.17.3=py38h7b6447c_0
  - pysocks=1.7.1=py38h06a4308_0
  - python=3.8.5=h7579374_1
  - python-dateutil=2.8.1=pyhd3eb1b0_0
  - python-dotenv=0.15.0=pyhd8ed1ab_0
  - python_abi=3.8=1_cp38
  - pytz=2021.1=pyhd3eb1b0_0
  - pywavelets=1.1.1=py38h7b6447c_2
  - pyyaml=5.4.1=py38h27cfd23_1
  - pyzmq=20.0.0=py38h2531618_1
  - qt=5.12.9=h763d07f_1
  - rasterio=1.2.0=py38h033aa8a_0
  - readline=8.1=h27cfd23_0
  - requests=2.25.1=pyhd3eb1b0_0
  - requests-oauthlib=1.3.0=py_0
  - rioxarray=0.2.0=pyhd8ed1ab_0
  - scikit-image=0.17.2=py38hdf5156a_0
  - scipy=1.5.2=py38h0b6359f_0
  - send2trash=1.5.0=pyhd3eb1b0_1
  - setuptools=52.0.0=py38h06a4308_0
  - shapely=1.7.1=py38ha11d057_1
  - six=1.15.0=py38h06a4308_0
  - snuggs=1.4.7=py_0
  - sortedcontainers=2.3.0=pyhd3eb1b0_0
  - sqlite=3.34.0=h74cdb3f_0
  - tbb=2020.3=hfd86e86_0
  - tblib=1.7.0=py_0
  - terminado=0.9.2=py38h06a4308_0
  - testpath=0.4.4=pyhd3eb1b0_0
  - thrift=0.11.0=py38he6710b0_0
  - tifffile=2020.10.1=py38hdd07704_2
  - tiledb=2.1.5=h17508cd_0
  - tk=8.6.10=hbc83047_0
  - toolz=0.11.1=pyhd3eb1b0_0
  - tornado=6.1=py38h27cfd23_0
  - traitlets=5.0.5=pyhd3eb1b0_0
  - typing_extensions=3.7.4.3=pyh06a4308_0
  - tzcode=2021a=h7f98852_0
  - urllib3=1.26.3=pyhd3eb1b0_0
  - wcwidth=0.2.5=py_0
  - webencodings=0.5.1=py38_1
  - wheel=0.36.2=pyhd3eb1b0_0
  - xarray=0.16.2=pyhd3eb1b0_0
  - xerces-c=3.2.3=hfe33f54_1
  - xorg-kbproto=1.0.7=h7f98852_1002
  - xorg-libice=1.0.10=h516909a_0
  - xorg-libsm=1.2.3=h84519dc_1000
  - xorg-libx11=1.6.12=h516909a_0
  - xorg-libxext=1.3.4=h516909a_0
  - xorg-libxrender=0.9.10=h516909a_1002
  - xorg-renderproto=0.11.1=h14c3975_1002
  - xorg-xextproto=7.3.0=h7f98852_1002
  - xorg-xproto=7.0.31=h27cfd23_1007
  - xz=5.2.5=h7b6447c_0
  - yaml=0.2.5=h7b6447c_0
  - yarl=1.6.3=py38h27cfd23_0
  - zarr=2.6.1=pyhd3eb1b0_0
  - zeromq=4.3.3=he6710b0_3
  - zict=2.0.0=py_0
  - zipp=3.4.0=pyhd3eb1b0_0
  - zlib=1.2.11=h7b6447c_3
  - zstd=1.4.5=h9ceee32_0
  - pip:
    - osgeo==0.0.0
    - pygdal==3.1.4.6
prefix: /home/donj/miniconda3/envs/py38_ADLS_POC

PeterFogh avatar May 18 '21 08:05 PeterFogh