duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

Error in executing join statements with parquet/duckdb using duckdb driver

Open Apoorwak opened this issue 1 year ago • 3 comments
trafficstars

What happens?

Scenario: Error in executing join statements on 64 million and 15 million records data that is stored in a duckdb and parquet file/table. Statements executed on parquet and duckdb using duckdb driver. CPU and RAM: 12 CPU 40 GB RAM Duckdb version: 0.9.0

Failed test cases on duckdb/parquet:

Inner join on 3 tables containing 64 M records each (joined on one string and one decimal column) Full join on 3 tables containing 64 M records each (joined on two string column) - (The data does not cause cartesian output) Inner join on 5 tables containing 15 M records each (joined on one string and one decimal column) Observation: We notice that DuckDB is erroring out while joining more than two files/tables containing 64 M records each and more than four files/tables containing 15 M records each (as highlighted in the above test cases). In local machine (4 CPU with 32 GB RAM), the same operation is throwing "database error".

To Reproduce

Queries ran on 64 million records data that is stored in a duckdb and parquet file/table :

import duckdb
conn = duckdb.connect("C:/duckfile.duckdb")
join_query='''SELECT * from file1_64M AS R1 join file2_64M AS R2 on R1.column1 = R2.column2 join file3_64M AS R3 on R2.column1 = R3.column3'''
result = conn.query(join_query).fetchmany(200)
print(result)
conn.close()
import duckdb
join_query='''SELECT * from read_parquet("C:/file1_64M.parquet") AS R1 
full join read_parquet("C:/file2_64M.parquet") AS R2 on R1.column1 = R2.column2
full join read_parquet("C:/file3_64M.parquet") AS R3 on R2.column1 = R3.column2'''
result = duckdb.query(join_query).fetchmany(200)
print(result)

OS:

Windows, x64

DuckDB Version:

0.9.0

DuckDB Client:

Python

Full Name:

Apoorwa Kumari

Affiliation:

EY

Have you tried this on the latest nightly build?

I have not tested with any build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • [X] Yes, I have

Apoorwak avatar Feb 08 '24 09:02 Apoorwak

Thanks for opening this issue! Based on our automated check, it seems that your post contains some code but it does not use code blocks to format it.

Please double-check your post and revise it if necessary. To employ syntax highlighting, it's recommended to use code blocks with triple backticks, e.g.:

```sql
SELECT ...
```

If this is a false positive, feel free to disregard this comment.

duckdblabs-bot avatar Feb 08 '24 09:02 duckdblabs-bot

Thanks for opening this issue! Based on our automated check, it seems that your post contains some code but it does not use code blocks to format it.

Please double-check your post and revise it if necessary. To employ syntax highlighting, it's recommended to use code blocks with triple backticks, e.g.:

```sql
SELECT ...

If this is a false positive, feel free to disregard this comment.

I have formatted using code block.

Apoorwak avatar Feb 08 '24 10:02 Apoorwak

Hi @Apoorwak thanks for adjusting the formatting.

Could you please share a data set that reproduces the issue? It can be a mock version of the original data of course.

szarnyasg avatar Feb 08 '24 17:02 szarnyasg