duckdb-wasm icon indicating copy to clipboard operation
duckdb-wasm copied to clipboard

Error while reading again a parquet file after browser reload

Open ericemc3 opened this issue 11 months ago • 28 comments

What happens?

Executing twice the same request, after reloading the shell page, yields an error.

To Reproduce

in https://shell.duckdb.org/, execute :

FROM 'https://static.data.gouv.fr/resources/communes-2023-format-parquet/20240122-085355/communes2023.parquet' 
SELECT codgeo WHERE epci = '200039865' ;

then reload the browser and execute that same query again.

On windows and with Chrome or Edge, i get: Invalid Error: TProtocolException: Invalid data

codgeo column, which is also the first of the dataset, seems to be responsible.

With Firefox, no issue.

OS:

Win11

DuckDB Version:

10.0.0

DuckDB Client:

shell wasm1.28.1-dev159.0

Full Name:

eric mauviere

Affiliation:

icem7

Have you tried this on the latest nightly build?

I have tested with a nightly build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • [X] Yes, I have

ericemc3 avatar Mar 04 '24 15:03 ericemc3

Hi @ericemc3, can you share the result of PRAGMA user_agent on the SQL side?

I am aware of a problem with threads, but on the browser I tested it was NOT enable by default, but your setup is somewhat different.

IFF the result contains wasm_threads, I am aware of the problem and working on a fix. You can specify explicitly the bundle going to https://shell.duckdb.org/?bundle=eh

IFF the result is wasm_eh, this is a new problem, and I will later look at that.

carlopi avatar Mar 04 '24 15:03 carlopi

(and thanks @szarnyasg, this looks like a duckdb/duckdb-wasm specific problem)

carlopi avatar Mar 04 '24 15:03 carlopi

Hi @carlopi, thank you for your prompt reaction. I get ┌─────────────────────────────┐ │ user_agent │ ╞═════════════════════════════╡ │ duckdb/v0.10.0(wasm_eh) cpp │

and the same issue with https://shell.duckdb.org/?bundle=eh

Please note that it works fine with Firefox (with the same user_agent displayed). So it looks like a Chrome/Edge specific issue as well.

ericemc3 avatar Mar 04 '24 16:03 ericemc3

The same thing can be duplicated in DuckDB-WASM v1.28.0 (DuckDB v0.9.1). I thought this was an issue in my app and was still troubleshooting as a low priority, so I haven't reported an issue yet. I just saw this pop up so thought I'd share we see this in v1.28.0 as well. I have worked around it for the moment by fetching the files outside of DuckDB doing registerFileBuffer and read_parquet over the registered files instead of URLs of the parquet files.

dude0001 avatar Mar 04 '24 18:03 dude0001

I've been experiencing this same issue while using Evidence (@evidence-dev/evidence), which uses @duckdb/duckdb-wasm as a dependency. Seems to be affecting Windows, and only Edge/Chrome. Page may load initially or error out, displaying Invalid error: TProtocolException: Invalid data. Turning off caching, or going into Incognito/InPrivate mode, and the query/page works correctly. We're querying Parquet files using the URL, just like this issue.

image

Affecting 1.28.0 (the Evidence dependency version) and the latest build 1.28.1-dev190.0.

timothyhoward avatar May 06 '24 05:05 timothyhoward

im running into this quite often with evidence. is there a workaround form duckdb-wasm side?

fboerman avatar May 06 '24 17:05 fboerman

What is the value of the 'Content-Type' header for the parquet file being fetched ?

To get that, even just opening browser console, select network, instruct duckdb-wasm to fetch the relevant resource, select the relevant row clicking on it, then there is a Headers tab, look for 'Content-Type'.

If that is 'text/plain', it might be connected to this problem: https://github.com/duckdb/duckdb-wasm/issues/1580 that is connected to a problem in the spec + implementation of it by Web-browsers.


Independently of the first question question, couple of other ones. Do you contol the server ? Can you share a URL that demo this (ideally with a repro within shell.duckdb.org)?

carlopi avatar May 06 '24 18:05 carlopi

hi @carlopi content-type is application/octet-stream when I start evidence and it starts pulling in parquet. I do not think its related to that issue given that that bug is firefox specific. this one is the other way around, being solely chrome and edge specific. I dont think I ever managed to get it in firefox.

yes I control the server, you can go to reports.coreflowbased.eu and register a free account to test it out. thats my website which from time to time throws this error. if you get a database timeout simply refresh the site thats a different bug

fboerman avatar May 06 '24 18:05 fboerman

That bug is 2 bugs in one, one in Chrome and one in Firefox.

But that's not so relevant. I have no idea how to reproduce within the setting of your website, if you can give some instructions, I might be able to give it a try tomorrow.

carlopi avatar May 06 '24 18:05 carlopi

@carlopi Content-Type for the Parquet files is application/octet-stream here as well. I feel this is definitely a caching issue, as when the cache is disabled in Chrome/Edge, everything runs correctly.

Here's where the exception occurs. I tried looking up the status code 42 but couldn't find a reference. image

timothyhoward avatar May 07 '24 04:05 timothyhoward

Can also confirm, the application works perfectly in Firefox on the same system (Windows 10), so it's confined to Edge and Chrome on Windows. Seems to be the exact same issue as the one @ericemc3 has described in the DuckDB Shell.

Edit: I've mitigated the issue on my application by disabling caching through the 'Cache-Control' header for parquet files. Error does not appear in Chrome/Edge with caching on Parquet files off.

timothyhoward avatar May 08 '24 00:05 timothyhoward

I managed to reproduce on a Windows machine, I am very puzzled by this bug given it's non deterministic. I found something strange that might be connected, fix in #1733 will likely be in either way (since it makes sense) and we can then revaluate.

carlopi avatar May 08 '24 11:05 carlopi

At the moment I am unable to reproduce this (on shell.duckdb.org, currently at @duckdb/[email protected]), but hard to say whether this is properly fixed in all cases.

carlopi avatar May 10 '24 15:05 carlopi

At the moment I am unable to reproduce this (on shell.duckdb.org, currently at @duckdb/[email protected]), but hard to say whether this is properly fixed in all cases.

Hi @carlopi, Thank you for looking into the issue. Unfortunately, still occurring for me - the shell here (at 1.28.1-dev194.0) continues to produce the TProtocolException issue (as per @ericemc3's original reproduction).

timothyhoward avatar May 12 '24 23:05 timothyhoward

@carlopi thanks for looking into this.

I am also able to reproduce this on shell.duckdb.org @duckdb/[email protected], running chrome on windows.

This does not occur on chrome on mac.

  1. Clear browser data
  2. Visit shell.duckdb.org
  3. Successfully run the query FROM 'https://static.data.gouv.fr/resources/communes-2023-format-parquet/20240122-085355/communes2023.parquet' SELECT codgeo WHERE epci = '200039865' ;
  4. Refresh the page
  5. Run the query again - Invalid Error: TProtocolException: Invalid data
  6. Clear browser data
  7. Run the query again - success
Windows 11 10.0.22621les                                                                    
Chrome: 125.0.6422.77

mcrascal avatar May 22 '24 15:05 mcrascal

Hi all, we have developed a new duckdb-wasm version that allows to explicitly set whether to trust Content-Length informations from HEAD requests.

I have a hard time reproducing this on my setup, it would be amazing if anyone could run the original issue in 2 additional modes:

SET reliable_head_requests = false;
FROM 'https://static.data.gouv.fr/resources/communes-2023-format-parquet/20240122-085355/communes2023.parquet' 
SELECT codgeo WHERE epci = '200039865' ;

and

SET reliable_head_requests = true;
FROM 'https://static.data.gouv.fr/resources/communes-2023-format-parquet/20240122-085355/communes2023.parquet' 
SELECT codgeo WHERE epci = '200039865' ;

Changing the setting for reliable_head_requests changes slightly the logic that computes the length of a given resources, reordering the order in which requests are performed, from HEAD to GET requests.

This should move away from the behaviour that here was problematic.

I am experimenting with what behaviour should be set by default, input on whether this helps with this particular problem would be handy.

carlopi avatar Jun 10 '24 11:06 carlopi

Hi, on Chrome and Edge Windows Database: v1.0.0 Package: @duckdb/[email protected] Still the same issue with:

SET reliable_head_requests = true;
FROM 'https://static.data.gouv.fr/resources/communes-2023-format-parquet/20240122-085355/communes2023.parquet' 
SELECT codgeo WHERE epci = '200039865' ;

executed twice, OK. Then i reload the page, paste the same request again and get: Invalid Error: TProtocolException: Invalid data

Is there another wasm version to test?

ericemc3 avatar Jun 10 '24 14:06 ericemc3

@ericemc3: can you possibly try?

SET reliable_head_requests = false;
FROM 'https://static.data.gouv.fr/resources/communes-2023-format-parquet/20240122-085355/communes2023.parquet' 
SELECT codgeo WHERE epci = '200039865' ;

Link that could possiblywork: https://shell.duckdb.org/#queries=v0,SET-reliable_head_requests-%3D-false~,FROM-'https%3A%2F%2Fstatic.data.gouv.fr%2Fresources%2Fcommunes%202023%20format%20parquet%2F20240122%20085355%2Fcommunes2023.parquet'-%0ASELECT-codgeo-WHERE-epci-%3D-'200039865'-~

Link with old behaviour: https://shell.duckdb.org/#queries=v0,SET-reliable_head_requests-%3D-true~,FROM-'https%3A%2F%2Fstatic.data.gouv.fr%2Fresources%2Fcommunes%202023%20format%20parquet%2F20240122%20085355%2Fcommunes2023.parquet'-%0ASELECT-codgeo-WHERE-epci-%3D-'200039865'-~

carlopi avatar Jun 10 '24 14:06 carlopi

yes i tried both with the same outcome: Invalid Error: TProtocolException: Invalid data

ericemc3 avatar Jun 10 '24 14:06 ericemc3

I've been investigating an issue with ObservableHQ that relies on DuckDB that seems to have the same behaviour as listed in this issue. It does appear to be caching related. Is there any further information as to what might be causing this issue?

https://github.com/observablehq/framework/issues/1470

massyn avatar Jul 03 '24 10:07 massyn

Hi, same issue, +1 !

GuillaumeChretien avatar Jul 31 '24 20:07 GuillaumeChretien

I've posted workaround over on the related Observable Framework issue already, but maybe this will help someone here as well until the issue is fixed. Obviously there are downsides, but I have been able to work around the issue by cache busting on any parquet link if a Windows device is detected. See example here.

bjyberg avatar Aug 02 '24 12:08 bjyberg

@bjyberg many thanks, it works very well ! hope future version will fix this...

GuillaumeChretienCerema avatar Aug 16 '24 16:08 GuillaumeChretienCerema