duckdb
duckdb copied to clipboard
Reading CSV from URL throws 400 error
Discussed in https://github.com/duckdb/duckdb_spatial/discussions/235
Originally posted by BruceHarold January 22, 2024 Hi Ducklings
I'm (attempting) to read_csv_auto from a URL: https://data.bloomington.in.gov/resource/aw6y-t4ix.csv I have spatial and httpfs extensions installed and loaded. I get:
IOException: IO Error: Unable to connect to URL "https://data.bloomington.in.gov/resource/aw6y-t4ix.csv": 400 (Bad Request)
I'm a newbie so must be missing something basic, anyone have some tips? Thanks.
I can reproduce the problem, not sure what's happening. But for the devs, here is a reproducer
import duckdb
con = duckdb.connect()
sql = "SELECT * from st_read('https://data.bloomington.in.gov/resource/aw6y-t4ix.csv')"
ext = "load spatial; load httpfs"
con.execute(ext)
#This fails
t = con.execute(sql)
---------------------------------------------------------------------------
IOException Traceback (most recent call last)
Cell In[7], line 1
----> 1 t = con.execute(sql)
IOException: IO Error: GDAL Error (4): Failed to open file https://data.bloomington.in.gov/resource/aw6y-t4ix.csv: IO Error: Unable to connect to URL "https://data.bloomington.in.gov/resource/aw6y-t4ix.csv": 400 (Bad Request)
Also tried read_csv and similar issue:
sql2 = "SELECT * from read_csv('https://data.bloomington.in.gov/resource/aw6y-t4ix.csv', AUTO_DETECT=TRUE)"
con.execute(sql2)
---------------------------------------------------------------------------
IOException Traceback (most recent call last)
Cell In[11], line 1
----> 1 con.execute(sql2)
IOException: IO Error: Unable to connect to URL "https://data.bloomington.in.gov/resource/aw6y-t4ix.csv": 400 (Bad Request)
Note that you can go to the link on the browser and the data will download, so I'm not sure what it's happening.
Hi!
This is issue is not related to duckdb_spatial, read_csv_auto is part of core DuckDB (and should be preferred over st_read when reading csv). After debugging a bit it seems like the server does not handle http range requests which DuckDB's httpfs extension uses to incrementally read parts of the data.
It should return a http 206 code, but returns 200 instead. Even if we accept a 200 they don't return the required Content-Length header.
@szarnyasg Maybe we can move this to core duckdb issue tracker?
@szarnyasg Maybe we can move this to core duckdb issue tracker?
Apologies I put it in the spatial repo in the first place ;-).
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.
This issue was closed because it has been stale for 30 days with no activity.