python core dump when downloading dataset
Describe the bug
When downloading a dataset in streamed mode and exiting the program before the download completes, the python program core dumps when exiting:
terminate called without an active exception
Aborted (core dumped)
Tested with python 3.12.3, python 3.9.21
Steps to reproduce the bug
Create python venv:
python -m venv venv
./venv/bin/activate
pip install datasets==4.4.1
Execute the following program:
from datasets import load_dataset
ds = load_dataset("HuggingFaceFW/fineweb-2", 'hrv_Latn', split="test", streaming=True)
for sample in ds:
break
Expected behavior
Clean program exit
Environment info
described above
note: the example works correctly when using datasets==3.1.0
Hi @hansewetz I'm curious, for me it works just fine. Are you still observing the issue?
Yup ... still the same issue.
However, after adding a sleep(1) call after the for loop by accident during debugging, the program terminates properly (not a good solution though ... :-) ).
Are there some threads created that handles the download that are still running when the program exits?
Haven't had time yet to go through the code in iterable_dataset.py::IterableDataset
Interesting, I was able to reproduce it, on a jupyter notebook the code runs just fine, as a Python script indeed it seems to never finish running (which is probably leading to the core dumped error). I'll try and take a look at the source code as well to see if I can figure it out.
Hi @hansewetz , If possible can I be assigned with this issue?
If possible can I be assigned with this issue?
Hi, I don't know how assignments work here and who can take decisions about assignments ...
Hi @hansewetz and @Aymuos22, I have made some progress:
-
Confirmed last working version is 3.1.0
-
From 3.1.0 to 3.2.0, there was a change in how parquet files are read (see here.
The issue seems to be the following code:
parquet_fragment.to_batches(
batch_size=batch_size,
columns=self.config.columns,
filter=filter_expr,
batch_readahead=0,
fragment_readahead=0,
)
Adding a use_threads=False parameter to the to_batches call solves the bug. However, this seems far from an optimal solution, since we'd like to be able to use multiple threads for reading the fragments.
I'll keep investigating to see if there's a better solution.
Hi @lhoestq, may I ask if the current behaviour was expected by you folks and you don't think it needs solving, or should I keep on investigating a compromise between using multithreading / avoid unexpected behaviour? Thanks in advance :)
Having the same issue. the code never stops executing. Using datasets 4.4.1 Tried with "islice" as well. When the streaming flag is True, the code doesn't end execution. On vs-code.
The issue on pyarrow side is here: https://github.com/apache/arrow/issues/45214 and the original issue in datasets here: https://github.com/huggingface/datasets/issues/7357
It would be cool to have a fix on the pyarrow side
Thank you very much @lhoestq, I'm reading the issue thread in pyarrow and realizing you've been raising awareness around this for a long time now. When I have some time I'll look at @pitrou's PR to see if I can get a better understanding of what's going on on pyarrow.