datasets python core dump when downloading dataset

Describe the bug

When downloading a dataset in streamed mode and exiting the program before the download completes, the python program core dumps when exiting:

terminate called without an active exception
Aborted (core dumped)

Tested with python 3.12.3, python 3.9.21

Steps to reproduce the bug

Create python venv:

python -m venv venv
./venv/bin/activate
pip install datasets==4.4.1

Execute the following program:

from datasets import load_dataset
ds = load_dataset("HuggingFaceFW/fineweb-2", 'hrv_Latn', split="test", streaming=True)
for sample in ds:
    break

Expected behavior

Clean program exit

Environment info

described above

note: the example works correctly when using datasets==3.1.0

Nov 24 '25 06:11 hansewetz

Hi @hansewetz I'm curious, for me it works just fine. Are you still observing the issue?

Nov 24 '25 14:11 Lucas-Fernandes-Martins

Yup ... still the same issue. However, after adding a sleep(1) call after the for loop by accident during debugging, the program terminates properly (not a good solution though ... :-) ). Are there some threads created that handles the download that are still running when the program exits? Haven't had time yet to go through the code in iterable_dataset.py::IterableDataset

Nov 24 '25 14:11 hansewetz

Interesting, I was able to reproduce it, on a jupyter notebook the code runs just fine, as a Python script indeed it seems to never finish running (which is probably leading to the core dumped error). I'll try and take a look at the source code as well to see if I can figure it out.

Nov 24 '25 15:11 Lucas-Fernandes-Martins

Hi @hansewetz , If possible can I be assigned with this issue?

Nov 25 '25 04:11 Aymuos22

If possible can I be assigned with this issue? Hi, I don't know how assignments work here and who can take decisions about assignments ...

Nov 25 '25 08:11 hansewetz

Hi @hansewetz and @Aymuos22, I have made some progress:

Confirmed last working version is 3.1.0
From 3.1.0 to 3.2.0, there was a change in how parquet files are read (see here.

The issue seems to be the following code:

parquet_fragment.to_batches(
                                batch_size=batch_size,
                                columns=self.config.columns,
                                filter=filter_expr,
                                batch_readahead=0,
                                fragment_readahead=0,
                            )

Adding a use_threads=False parameter to the to_batches call solves the bug. However, this seems far from an optimal solution, since we'd like to be able to use multiple threads for reading the fragments.

I'll keep investigating to see if there's a better solution.

Nov 25 '25 13:11 Lucas-Fernandes-Martins

Hi @lhoestq, may I ask if the current behaviour was expected by you folks and you don't think it needs solving, or should I keep on investigating a compromise between using multithreading / avoid unexpected behaviour? Thanks in advance :)

Nov 25 '25 13:11 Lucas-Fernandes-Martins

Having the same issue. the code never stops executing. Using datasets 4.4.1 Tried with "islice" as well. When the streaming flag is True, the code doesn't end execution. On vs-code.

Nov 25 '25 13:11 MugheeraSaleem

The issue on pyarrow side is here: https://github.com/apache/arrow/issues/45214 and the original issue in datasets here: https://github.com/huggingface/datasets/issues/7357

It would be cool to have a fix on the pyarrow side

Nov 25 '25 18:11 lhoestq

Thank you very much @lhoestq, I'm reading the issue thread in pyarrow and realizing you've been raising awareness around this for a long time now. When I have some time I'll look at @pitrou's PR to see if I can get a better understanding of what's going on on pyarrow.

Nov 25 '25 20:11 Lucas-Fernandes-Martins