duckdb Crash while scanning PyArrow Dataset

What happens?

When trying to scan a large number of Feather files via pyarrow.dataset I get a segfault in duckdb::StringHeap::AddBlob:

#0  0x00007ffff6f27e85 in __memmove_avx_unaligned_erms () from /usr/lib64/libc.so.6
#1  0x00007fffd6401c26 in duckdb::StringHeap::AddBlob(char const*, unsigned long) ()
   from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#2  0x00007fffd644b997 in void duckdb::ColumnDataCopy<duckdb::string_t>(duckdb::ColumnDataMetaData&, duckdb::UnifiedVectorFormat const&, duckdb::Vector&, unsigned long, unsigned long)
    () from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#3  0x00007fffd644df4a in duckdb::ColumnDataCollection::Append(duckdb::ColumnDataAppendState&, duckdb::DataChunk&) ()
   from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#4  0x00007fffd697d415 in duckdb::PhysicalBatchCollector::Sink(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSinkInput&) const ()
   from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#5  0x00007fffd6d7ef6a in duckdb::PipelineExecutor::ExecutePushInternal(duckdb::DataChunk&, unsigned long) ()
   from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#6  0x00007fffd6d834e5 in duckdb::PipelineExecutor::Execute(unsigned long) ()
   from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#7  0x00007fffd6d88529 in duckdb::PipelineTask::ExecuteTask(duckdb::TaskExecutionMode) ()
   from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#8  0x00007fffd6d794f3 in duckdb::ExecutorTask::Execute(duckdb::TaskExecutionMode) ()
   from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#9  0x00007fffd6d7ae85 in duckdb::TaskScheduler::ExecuteForever(std::atomic<bool>*) ()
   from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#10 0x00007fffd70b29e0 in execute_native_thread_routine () from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#11 0x00007ffff7bb71cf in start_thread () from /usr/lib64/libpthread.so.0
#12 0x00007ffff6e91dd3 in clone () from /usr/lib64/libc.so.6

To Reproduce

The code I'm executing is as follows:

import duckdb
import pyarrow.dataset as ds

files = [...]
duckdb.arrow(ds.dataset(files, format="feather")).arrow()

Unfortunately I'm unable to provide the files which trigger the crash :-(

I can confirm that all files have the same schema, and going direct to a pa.Table does not crash.

OS:

x64

DuckDB Version:

0.9.2

DuckDB Client:

Python

Full Name:

Matt Aubury

Affiliation:

.

Have you tried this on the latest nightly build?

I have tested with a nightly build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

[X] Yes, I have

Feb 12 '24 17:02 mattaubury

That sounds like a data ownership issue, heap-use-after-free if I'd have to guess

I'd be curious what data you're consuming

I also don't think the result collection method is required to reproduce this, can you try .fetchall() or fetchmany ?

Feb 12 '24 17:02 Tishj

It crashes with .fetchall(), and with .fetchmany() if I pass a large enough number.

The data is not very unusual, it's 5.6 million rows of int64, string, float64, date32 columns, over 278 files.

My guess was that duckdb was interning the strings but somehow overflowed the reserved space?

Feb 12 '24 17:02 mattaubury

what do you mean by interning?

The part it crashes at is constructing the materialized result, which performs a copy

But until then it assumes it has (read only) ownership of the arrow data

Feb 12 '24 18:02 Tishj

Internal dictionary encoding (see https://en.wikipedia.org/wiki/String_interning), but that was just a hypothesis, I don't know duckdb internals at all.

Feb 12 '24 18:02 mattaubury

@mattaubury that sounds like our dictionary vectortype but I don't think we produce that from an arrow scan

Feb 12 '24 18:02 Tishj

Okay, I found a specific column that seemed to be causing issues, and managed to create a reproducer (at least on my machine...):

import duckdb
import pyarrow.feather
import pyarrow.dataset as ds
import pyarrow as pa
import random

tables = []
for i in range(100):
    array = pa.array(
        random.choice(["ABCD", "FOOO", "BARR"]) * 17000,
        type=pa.dictionary(pa.int16(), pa.string()),
    )
    tables.append(pa.table([array], names=["x"]))

duckdb.arrow(ds.dataset(tables)).fetchall()

This crashes with a segfault as before.

[edit: skip creating feather, scanning tables in-memory shows the same issue]

Feb 12 '24 18:02 mattaubury

duckdb duckdb copied to clipboard

Crash while scanning PyArrow Dataset

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest nightly build?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

duckdb
duckdb copied to clipboard