[Bug]: jupysql + DuckDB -- reading json lines fails to parallelise across threads
What happened?
I'm using DuckDB in JupyterLab with jupysql (0.7.4) and running into a funny performance issue, where I'm not seeing parallel loading of JSON lines files that I see when I use DuckDB via the Python API directly. (I've just been eye-balling htop to see when all my threads light up, or if only one is used)
Note that %config SqlMagic.autopandas = False was set for the following comparisons.
%%sql tmp <<
-- this does *not* parallelise
select 1 from read_ndjson_auto('jsonl_data/*.jsonl')
# whereas running in python *does* parallelise
# doing a fetchall to match jupysql automatically pulling the complete set
results = duckdb.sql(f"select 1 from read_ndjson_auto('jsonl_data/*.jsonl')").fetchall()
Weirdly, I worked out that the following variations on the above query do in fact allow parallelisation:
- adding a
limitclause countthe rows insteadpartitionover the results
%%sql tmp <<
-- this *does* parallelise
select 1 from read_ndjson_auto('jsonl_data/*.jsonl') limit 100000000
%%sql tmp <<
-- this *does* parallelise
select count() from read_ndjson_auto('jsonl_data/*.jsonl')
%%sql tmp <<
-- this *does* parallelise
select row_number() OVER (PARTITION BY id) as seqnum from read_ndjson_auto('jsonl_data/*.jsonl')
Here's a Python function to generate some dummy JSON lines data for debugging. (with these defaults, writes about 7GB to disk):
def make_dummy_jsonl(path, num_files=200, length=1_000_000):
target_path = Path(path)
target_path.mkdir(parents=True, exist_ok=True)
for file_num in range(1, num_files + 1):
data = [
{
"id": row_num,
"val1": randrange(100),
"val2": randrange(100),
"val3": randrange(100),
}
for row_num in range(1, length + 1)
]
with open(target_path / f"dummy_{file_num:03}.jsonl", "w") as file:
file.write("\n".join(json.dumps(row) for row in data))
make_dummy_jsonl("jsonl_data")
While putting this together I realised that there is actually no parallelisation for DuckDB 7.1.0, and I needed to use a 0.7.2 pre-release to get any parallelisation, even in the Python example. So it could be a little premature to be debugging this.
Also, not sure if this is the right place for this issue, but it's a starting point and can move if needed.
DuckDB Engine Version
0.7.0
DuckDB Version
0.7.2.dev2699
SQLAlchemy Version
2.0.11
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
I just took magic_duckdb for a spin (a DuckDB specific SQL magic tool for Jupyter) and found that parallelisation works with it. Furthermore, magic_duck seems to be generally faster that JupySQL, offering comparable speeds to using DuckDB directly. I realised during all this testing that queries in JupySQL suffer a noticeable performance hit.