Error when reading a remote parquet
What happens?
I get an error when reading a remote parquet with Python that I don't get with the CLI for the same request.
To Reproduce
I am trying to read a remote parquet file. I am attempting it with both the CLI and the Python package, both on the latest version 1.2.0.
When I do it from the CLI, it works fine:
D select * from read_parquet('https://data.opendatasoft.com/api/explore/v2.1/catalog/datasets/insee-departements@equipements-sgsocialgouv/exports/parquet?lang=fr&timezone=Europe%2FBerlin') ;
┌──────────┬──────────────────────────┬──────────┬─────────────────────────────────────────────┬───────────────────────────────────────────────────────────┐
│ dep_code │ dep_nom │ reg_code │ reg_nom │ aca_nom │
│ varchar │ varchar │ varchar │ varchar │ varchar │
├──────────┼──────────────────────────┼──────────┼─────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│ 84 │ Vaucluse │ 93 │ Provence-Alpes-Côte d'Azur │ Académie d'Aix-Marseille │
│ 2 │ Aisne │ 32 │ Hauts-de-France │ Académie d'Amiens │
│ 24 │ Dordogne │ 75 │ Nouvelle-Aquitaine │ Académie de Bordeaux
However, with the Python package, I get this error:
>>> duckdb.sql("select * from read_parquet('https://data.opendatasoft.com/api/explore/v2.1/catalog/datasets/insee-departements@equipements-sgsocialgouv/exports/parquet?lang=fr&timezone=Europe%2FBerlin') ;")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
duckdb.duckdb.IOException: IO Error: Server sent back more data than expected, `SET force_download=true` might help in this case
I understand that you have to do a SET force_download=true but I don't understand why I have to do this in python and not on the CLI. What is the difference?
OS:
Kubuntu 24.04
DuckDB Version:
1.2.0
DuckDB Client:
Python and CLI
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have not tested with any build
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
- [x] Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- [x] Yes, I have
I only can reproduce it using the Python Interpreter in interactive mode
Does it happen if you create a python script ?
import duckdb
duckdb.sql("select * from read_parquet('https://data.opendatasoft.com/api/explore/v2.1/catalog/datasets/insee-departements@equipements-sgsocialgouv/exports/parquet?lang=fr&timezone=Europe%2FBerlin') ;")
and then
python myscript.py
When running in interactive mode, the File Handle is not cached anymore and starts a RANGE request that fails. (even that previously it was filled for the same path name using full download) When running from duckdb native client or using the script mode in python the parquet read is operating on the cache given a previous full download was call.
When the file is Read, and not in cache (that should be because full download was invoked properly), initiate a Range Request, and that seems to lead to a different issue given the remote server is not indicating Range Request support in the HEAD headers,
@florentfougeres can you validate if it's happening on interactive mode only?
Don't know how useful this is, but I ran into this problem when trying to use a roll-your-own mock implementation of an s3 server I use for testing. That implementation was missing two things:
- Support for http
Rangerequests (almost a must-have for duckdb I'd say, especially when working with parquet files). - Support for http HEAD requests