duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

Error when reading a remote parquet

Open florentfougeres opened this issue 10 months ago • 3 comments

What happens?

I get an error when reading a remote parquet with Python that I don't get with the CLI for the same request.

To Reproduce

I am trying to read a remote parquet file. I am attempting it with both the CLI and the Python package, both on the latest version 1.2.0.

When I do it from the CLI, it works fine:

D select * from read_parquet('https://data.opendatasoft.com/api/explore/v2.1/catalog/datasets/insee-departements@equipements-sgsocialgouv/exports/parquet?lang=fr&timezone=Europe%2FBerlin') ;
┌──────────┬──────────────────────────┬──────────┬─────────────────────────────────────────────┬───────────────────────────────────────────────────────────┐
│ dep_code │         dep_nom          │ reg_code │                   reg_nom                   │                          aca_nom                          │
│ varchar  │         varchar          │ varchar  │                   varchar                   │                          varchar                          │
├──────────┼──────────────────────────┼──────────┼─────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤
│ 84       │ Vaucluse                 │ 93       │ Provence-Alpes-Côte d'Azur                  │ Académie d'Aix-Marseille                                  │
│ 2        │ Aisne                    │ 32       │ Hauts-de-France                             │ Académie d'Amiens                                         │
│ 24       │ Dordogne                 │ 75       │ Nouvelle-Aquitaine                          │ Académie de Bordeaux

However, with the Python package, I get this error:

>>> duckdb.sql("select * from read_parquet('https://data.opendatasoft.com/api/explore/v2.1/catalog/datasets/insee-departements@equipements-sgsocialgouv/exports/parquet?lang=fr&timezone=Europe%2FBerlin') ;")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
duckdb.duckdb.IOException: IO Error: Server sent back more data than expected, `SET force_download=true` might help in this case

I understand that you have to do a SET force_download=true but I don't understand why I have to do this in python and not on the CLI. What is the difference?

OS:

Kubuntu 24.04

DuckDB Version:

1.2.0

DuckDB Client:

Python and CLI

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have not tested with any build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • [x] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • [x] Yes, I have

florentfougeres avatar Feb 24 '25 14:02 florentfougeres

I only can reproduce it using the Python Interpreter in interactive mode

Does it happen if you create a python script ?

import duckdb

duckdb.sql("select * from read_parquet('https://data.opendatasoft.com/api/explore/v2.1/catalog/datasets/insee-departements@equipements-sgsocialgouv/exports/parquet?lang=fr&timezone=Europe%2FBerlin') ;")

and then

python myscript.py

When running in interactive mode, the File Handle is not cached anymore and starts a RANGE request that fails. (even that previously it was filled for the same path name using full download) When running from duckdb native client or using the script mode in python the parquet read is operating on the cache given a previous full download was call.

When the file is Read, and not in cache (that should be because full download was invoked properly), initiate a Range Request, and that seems to lead to a different issue given the remote server is not indicating Range Request support in the HEAD headers,

lcostantino avatar Mar 04 '25 15:03 lcostantino

@florentfougeres can you validate if it's happening on interactive mode only?

lcostantino avatar Mar 16 '25 20:03 lcostantino

Don't know how useful this is, but I ran into this problem when trying to use a roll-your-own mock implementation of an s3 server I use for testing. That implementation was missing two things:

  1. Support for http Range requests (almost a must-have for duckdb I'd say, especially when working with parquet files).
  2. Support for http HEAD requests

J-Zeitler avatar Dec 02 '25 14:12 J-Zeitler