duckdb
duckdb copied to clipboard
Crash while scanning PyArrow Dataset
What happens?
When trying to scan a large number of Feather files via pyarrow.dataset
I get a segfault in duckdb::StringHeap::AddBlob
:
#0 0x00007ffff6f27e85 in __memmove_avx_unaligned_erms () from /usr/lib64/libc.so.6
#1 0x00007fffd6401c26 in duckdb::StringHeap::AddBlob(char const*, unsigned long) ()
from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#2 0x00007fffd644b997 in void duckdb::ColumnDataCopy<duckdb::string_t>(duckdb::ColumnDataMetaData&, duckdb::UnifiedVectorFormat const&, duckdb::Vector&, unsigned long, unsigned long)
() from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#3 0x00007fffd644df4a in duckdb::ColumnDataCollection::Append(duckdb::ColumnDataAppendState&, duckdb::DataChunk&) ()
from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#4 0x00007fffd697d415 in duckdb::PhysicalBatchCollector::Sink(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSinkInput&) const ()
from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#5 0x00007fffd6d7ef6a in duckdb::PipelineExecutor::ExecutePushInternal(duckdb::DataChunk&, unsigned long) ()
from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#6 0x00007fffd6d834e5 in duckdb::PipelineExecutor::Execute(unsigned long) ()
from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#7 0x00007fffd6d88529 in duckdb::PipelineTask::ExecuteTask(duckdb::TaskExecutionMode) ()
from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#8 0x00007fffd6d794f3 in duckdb::ExecutorTask::Execute(duckdb::TaskExecutionMode) ()
from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#9 0x00007fffd6d7ae85 in duckdb::TaskScheduler::ExecuteForever(std::atomic<bool>*) ()
from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#10 0x00007fffd70b29e0 in execute_native_thread_routine () from /home/maubury/.conda/envs/juggernaut/lib/python3.11/site-packages/duckdb/duckdb.cpython-311-x86_64-linux-gnu.so
#11 0x00007ffff7bb71cf in start_thread () from /usr/lib64/libpthread.so.0
#12 0x00007ffff6e91dd3 in clone () from /usr/lib64/libc.so.6
To Reproduce
The code I'm executing is as follows:
import duckdb
import pyarrow.dataset as ds
files = [...]
duckdb.arrow(ds.dataset(files, format="feather")).arrow()
Unfortunately I'm unable to provide the files which trigger the crash :-(
I can confirm that all files have the same schema, and going direct to a pa.Table
does not crash.
OS:
x64
DuckDB Version:
0.9.2
DuckDB Client:
Python
Full Name:
Matt Aubury
Affiliation:
.
Have you tried this on the latest nightly build?
I have tested with a nightly build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- [X] Yes, I have
That sounds like a data ownership issue, heap-use-after-free if I'd have to guess
I'd be curious what data you're consuming
I also don't think the result collection method is required to reproduce this, can you try .fetchall()
or fetchmany ?
It crashes with .fetchall()
, and with .fetchmany()
if I pass a large enough number.
The data is not very unusual, it's 5.6 million rows of int64
, string
, float64
, date32
columns, over 278 files.
My guess was that duckdb was interning the strings but somehow overflowed the reserved space?
what do you mean by interning?
The part it crashes at is constructing the materialized result, which performs a copy
But until then it assumes it has (read only) ownership of the arrow data
Internal dictionary encoding (see https://en.wikipedia.org/wiki/String_interning), but that was just a hypothesis, I don't know duckdb internals at all.
@mattaubury that sounds like our dictionary vectortype but I don't think we produce that from an arrow scan
Okay, I found a specific column that seemed to be causing issues, and managed to create a reproducer (at least on my machine...):
import duckdb
import pyarrow.feather
import pyarrow.dataset as ds
import pyarrow as pa
import random
tables = []
for i in range(100):
array = pa.array(
random.choice(["ABCD", "FOOO", "BARR"]) * 17000,
type=pa.dictionary(pa.int16(), pa.string()),
)
tables.append(pa.table([array], names=["x"]))
duckdb.arrow(ds.dataset(tables)).fetchall()
This crashes with a segfault as before.
[edit: skip creating feather, scanning tables in-memory shows the same issue]