tsfresh
tsfresh copied to clipboard
BrokenPipeError
The problem:
I'm trying to run extract_features
on my data but keep getting BrokenPipeError
. I tried it on two different computers (both with the same environment) with the same error. The dataset is quite large, merged DataFrame shape: (880169, 522), so it is expected to run for 20 hours. It runs for a few hours and then crashes,
Settings:
features = extract_features(
merged_time_series,
column_id="id",
default_fc_parameters=ComprehensiveFCParameters(),
n_jobs=15,
impute_function=impute,
)
Error (repeated many times):
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Anything else we need to know?:
I also tried running it with a smaller chunksize
and fewer jobs, but with no change.
features = extract_features(
merged_time_series,
column_id="id",
default_fc_parameters=ComprehensiveFCParameters(),
n_jobs=8,
impute_function=impute,
chunksize=1,
)
Environment:
- Python version: 3.10
- Operating System: Ubuntu 22.04
- tsfresh version: 0.19.0
- Install method (conda, pip, source): pip
Hi @johan-sightic ! Thanks for filing the issue and sorry for the long delay! The "Broken pipe" typically means that the worker processes have been killed for some reason by the OS. Most likely, this is due to memory issues. If your data consists of multiple IDs I would recommend you produce the features maybe in chunks of identifiers. If your data consists of just a single ID, you might either want to use a bigger machine or produce features for windows of the data (this is different from the features for the full data, but maybe your use-case allows for this)