duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

Reading CSV from URL throws 400 error

Open BruceHarold opened this issue 1 year ago • 5 comments
trafficstars

Discussed in https://github.com/duckdb/duckdb_spatial/discussions/235

Originally posted by BruceHarold January 22, 2024 Hi Ducklings

I'm (attempting) to read_csv_auto from a URL: https://data.bloomington.in.gov/resource/aw6y-t4ix.csv I have spatial and httpfs extensions installed and loaded. I get:

IOException: IO Error: Unable to connect to URL "https://data.bloomington.in.gov/resource/aw6y-t4ix.csv": 400 (Bad Request)

I'm a newbie so must be missing something basic, anyone have some tips? Thanks.

BruceHarold avatar Jan 23 '24 14:01 BruceHarold

I can reproduce the problem, not sure what's happening. But for the devs, here is a reproducer

import duckdb 

con = duckdb.connect()

sql = "SELECT * from st_read('https://data.bloomington.in.gov/resource/aw6y-t4ix.csv')"
ext = "load spatial; load httpfs"

con.execute(ext)

#This fails
t = con.execute(sql)
---------------------------------------------------------------------------
IOException                               Traceback (most recent call last)
Cell In[7], line 1
----> 1 t = con.execute(sql)

IOException: IO Error: GDAL Error (4): Failed to open file https://data.bloomington.in.gov/resource/aw6y-t4ix.csv: IO Error: Unable to connect to URL "https://data.bloomington.in.gov/resource/aw6y-t4ix.csv": 400 (Bad Request)

Also tried read_csv and similar issue:

sql2 = "SELECT * from read_csv('https://data.bloomington.in.gov/resource/aw6y-t4ix.csv', AUTO_DETECT=TRUE)"

con.execute(sql2)
---------------------------------------------------------------------------
IOException                               Traceback (most recent call last)
Cell In[11], line 1
----> 1 con.execute(sql2)

IOException: IO Error: Unable to connect to URL "https://data.bloomington.in.gov/resource/aw6y-t4ix.csv": 400 (Bad Request)

Note that you can go to the link on the browser and the data will download, so I'm not sure what it's happening.

ncclementi avatar Jan 23 '24 14:01 ncclementi

Hi! This is issue is not related to duckdb_spatial, read_csv_auto is part of core DuckDB (and should be preferred over st_read when reading csv). After debugging a bit it seems like the server does not handle http range requests which DuckDB's httpfs extension uses to incrementally read parts of the data.

It should return a http 206 code, but returns 200 instead. Even if we accept a 200 they don't return the required Content-Length header.

Maxxen avatar Jan 24 '24 18:01 Maxxen

@szarnyasg Maybe we can move this to core duckdb issue tracker?

Maxxen avatar Jan 24 '24 18:01 Maxxen

@szarnyasg Maybe we can move this to core duckdb issue tracker?

Apologies I put it in the spatial repo in the first place ;-).

BruceHarold avatar Jan 24 '24 18:01 BruceHarold

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] avatar Jul 02 '24 00:07 github-actions[bot]

This issue was closed because it has been stale for 30 days with no activity.

github-actions[bot] avatar Aug 01 '24 00:08 github-actions[bot]