[Bug]: jupysql + DuckDB -- reading json lines fails to parallelise across threads

Open ned2 opened this issue 2 years ago • 1 comments

What happened?

I'm using DuckDB in JupyterLab with jupysql (0.7.4) and running into a funny performance issue, where I'm not seeing parallel loading of JSON lines files that I see when I use DuckDB via the Python API directly. (I've just been eye-balling htop to see when all my threads light up, or if only one is used)

Note that %config SqlMagic.autopandas = False was set for the following comparisons.

%%sql tmp <<
-- this does *not* parallelise 
select 1 from read_ndjson_auto('jsonl_data/*.jsonl')

# whereas running in python *does* parallelise
# doing a fetchall to match jupysql automatically pulling the complete set  
results = duckdb.sql(f"select 1 from read_ndjson_auto('jsonl_data/*.jsonl')").fetchall()

Weirdly, I worked out that the following variations on the above query do in fact allow parallelisation:

adding a limit clause
count the rows instead
partition over the results

%%sql tmp <<
-- this *does* parallelise 
select 1 from read_ndjson_auto('jsonl_data/*.jsonl') limit 100000000

%%sql tmp <<
-- this *does* parallelise 
select count() from read_ndjson_auto('jsonl_data/*.jsonl')

%%sql tmp <<
-- this *does* parallelise 
select row_number() OVER (PARTITION BY id) as seqnum from read_ndjson_auto('jsonl_data/*.jsonl')

Here's a Python function to generate some dummy JSON lines data for debugging. (with these defaults, writes about 7GB to disk):

def make_dummy_jsonl(path, num_files=200, length=1_000_000):
    target_path = Path(path)
    target_path.mkdir(parents=True, exist_ok=True)
    for file_num in range(1, num_files + 1):
        data = [
            {
                "id": row_num,
                "val1": randrange(100),
                "val2": randrange(100),
                "val3": randrange(100),
            }
            for row_num in range(1, length + 1)
        ]
        with open(target_path / f"dummy_{file_num:03}.jsonl", "w") as file:
            file.write("\n".join(json.dumps(row) for row in data))

make_dummy_jsonl("jsonl_data")

While putting this together I realised that there is actually no parallelisation for DuckDB 7.1.0, and I needed to use a 0.7.2 pre-release to get any parallelisation, even in the Python example. So it could be a little premature to be debugging this.

Also, not sure if this is the right place for this issue, but it's a starting point and can move if needed.

DuckDB Engine Version

0.7.0

DuckDB Version

0.7.2.dev2699

SQLAlchemy Version

2.0.11

Relevant log output

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Apr 29 '23 15:04 ned2

I just took magic_duckdb for a spin (a DuckDB specific SQL magic tool for Jupyter) and found that parallelisation works with it. Furthermore, magic_duck seems to be generally faster that JupySQL, offering comparable speeds to using DuckDB directly. I realised during all this testing that queries in JupySQL suffer a noticeable performance hit.

Apr 29 '23 16:04 ned2