polars icon indicating copy to clipboard operation
polars copied to clipboard

PyDeltaTableError: Generic S3 error: Error performing get request

Open shazamkash opened this issue 2 years ago • 1 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

I get the error shown below when I try to use Polars to read data from delta lake. My delta lake is Non-AWS (Ceph based).

The parquet file is about 1 GB compressed and 3 GB uncompressed in size. Furthermore, the table was written to deltalake using the delta-rs python binding.

Environment: Delta-rs version: 0.8.1 Binding: Python Docker container: Python: 3.10.10 OS: Debian GNU/Linux 11 (bullseye) S3: Non-AWS (Ceph based)

---------------------------------------------------------------------------
PyDeltaTableError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 pl_data = pl.read_delta(source=table_uri, storage_options=storage_options)
      2 print(pl_data)

File /opt/conda/lib/python3.10/site-packages/polars/utils/decorators.py:136, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    134 if len(args) > num_allowed_args:
    135     warnings.warn(msg, DeprecationWarning, stacklevel=stacklevel)
--> 136 return function(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/polars/utils/decorators.py:37, in deprecated_alias.<locals>.deco.<locals>.wrapper(*args, **kwargs)
     34 @functools.wraps(function)
     35 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     36     _rename_kwargs(function.__name__, kwargs, aliases, stacklevel=stacklevel)
---> 37     return function(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/polars/io/delta.py:141, in read_delta(source, version, columns, storage_options, delta_table_options, pyarrow_options)
    132 resolved_uri = _resolve_delta_lake_uri(source)
    134 dl_tbl = _get_delta_lake_table(
    135     table_path=resolved_uri,
    136     version=version,
    137     storage_options=storage_options,
    138     delta_table_options=delta_table_options,
    139 )
--> 141 return from_arrow(dl_tbl.to_pyarrow_table(columns=columns, **pyarrow_options))

File /opt/conda/lib/python3.10/site-packages/deltalake/table.py:400, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    386 def to_pyarrow_table(
    387     self,
    388     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    389     columns: Optional[List[str]] = None,
    390     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    391 ) -> pyarrow.Table:
    392     """
    393     Build a PyArrow Table using data from the DeltaTable.
    394 
   (...)
    398     :return: the PyArrow table
    399     """
--> 400     return self.to_pyarrow_dataset(
    401         partitions=partitions, filesystem=filesystem
    402     ).to_table(columns=columns)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

PyDeltaTableError: Generic S3 error: Error performing get request xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet: response error "<html><body><h1>429 Too Many Requests</h1>
You have sent too many requests in a given amount of time.
</body></html>
", after 0 retries: HTTP status client error (429 Too Many Requests) for url (https://xxx.yyy.zzz.net/delta-lake-bronze/xxx/yyy/data_3_gb/0-ccc89437-58a8-44a4-aad2-17ffce7dd929-0.parquet)

The reading of table works fine if the data is small in size, for example few 10's of MB. Seems like this problem only happens for data which is big in size. I get the same error when reading data using delta-rs to_pandas() and to_pyarrow_dataset() function.

I have opened the same issue on delta-rs, but no help so far: https://github.com/delta-io/delta-rs/issues/1256

Reproducible example

import polars as pl

storage_options = {"AWS_ACCESS_KEY_ID": f"{credentials.access_key}", 
                   "AWS_SECRET_ACCESS_KEY": f"{credentials.secret_key}",
                   "AWS_ENDPOINT_URL": "https://xxx.yyy.zzz.net",
                   "AWS_S3_ALLOW_UNSAFE_RENAME": "True",
                  }

table_path = "s3a://delta-lake-bronze/xxx/yyy/data_3_gb"
pl_data = pl.read_delta(source=table_path , storage_options=storage_options)

Expected behavior

I am expecting the data to be read from delta lake into dataframe. I am able to read the same data from Pyspark so that confirms nothing is wrong with my delta table.

Installed versions

---Version info---
Polars: 0.16.18
Index type: UInt32
Platform: Linux-5.4.0-96-generic-x86_64-with-glibc2.35
Python: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
---Optional dependencies---
numpy: 1.23.5
pandas: 2.0.0
pyarrow: 11.0.0
connectorx: <not installed>
deltalake: 0.8.1
fsspec: 2023.3.0
matplotlib: 3.7.1
xlsx2csv: <not installed>
xlsxwriter: <not installed>

shazamkash avatar Apr 05 '23 12:04 shazamkash

Need to track this on delta-rs side. Thanks for raising the ticket

chitralverma avatar May 07 '23 07:05 chitralverma