pandarallel
pandarallel copied to clipboard
parallel_apply results in EOFError when run from Pycharm, works fine from Jupyter Notebook
I was trying to parallelise my code with pandarallel package in the following way:
import pandas as pd
from sklearn.cluster import SpectralClustering
from pandarallel import pandarallel
import numpy as np
ex = {'measurement_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1}, 'time': {0: 30000, 1: 30000, 2: 30000, 3: 30000, 4: 30000, 5: 30000, 6: 30000, 7: 30000, 8: 30000, 9: 30000, 10: 30100, 11: 30100, 12: 30100, 13: 30100, 14: 30100, 15: 30100, 16: 30100, 17: 30100, 18: 30100, 19: 30100}, 'group': {0: '0', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0'}, 'object': {0: 'obj1', 1: 'obj10', 2: 'obj2', 3: 'obj3', 4: 'obj4', 5: 'obj5', 6: 'obj6', 7: 'obj7', 8: 'obj8', 9: 'obj9', 10: 'obj1', 11: 'obj10', 12: 'obj2', 13: 'obj3', 14: 'obj4', 15: 'obj5', 16: 'obj6', 17: 'obj7', 18: 'obj8', 19: 'obj9'}, 'x': {0: 55.507999420166016, 1: 49.67399978637695, 2: 61.9640007019043, 3: 67.98300170898438, 4: 49.43199920654297, 5: 40.34000015258789, 6: 69.50399780273438, 7: 49.65800094604492, 8: 68.48200225830078, 9: 37.87900161743164, 10: 55.595001220703125, 11: 49.52399826049805, 12: 61.92499923706055, 13: 67.91799926757812, 14: 49.30099868774414, 15: 40.141998291015625, 16: 69.49299621582031, 17: 49.775001525878906, 18: 68.4010009765625, 19: 37.77899932861328}}
ex = pd.DataFrame.from_dict(ex).set_index(['measurement_id', 'time', 'group'])
def cluster(x, index):
x = np.asarray(x)[:, np.newaxis]
clustering = SpectralClustering(n_clusters = 3, random_state = 42, gamma = 1 / 50).fit(x)
return pd.Series(clustering.labels_ + 1, index = index)
pandarallel.initialize(nb_workers=2, progress_bar=True)
ex \
.groupby(['measurement_id', 'time', 'group']) \
.parallel_apply(lambda x: cluster(x['x'], x['object']))
However, when I'm running this on Pycharm I get the following error:
File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-84-7c89aedcfad4>", line 13, in <module> .parallel_apply(lambda x: cluster(x['x'], x['object'])) File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 451, in closure map_result, File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 358, in get_workers_result message_type, message = queue.get() File "<string>", line 2, in get File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod kind, result = conn.recv() File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError
I thought that this is maybe due to some incompatibility with the latest pandas or python release and tried to recreate the issue with different environment on Jupyter Notebook. It worked well so I tested the same environment on Jupyter notebook - it worked fine. I made sure that I'm running the same environment with
import sys
print(sys.executable)
and this is indeed a case. So the only difference seems to that I use PyCharm instead of Jupyter Notebook. My environment is set up with Python 3.7.6 and pandas 1.0.1.
I have experienced the same exact issue.
Same for me in PyCharm. Havn't tried a different IDE yet. However, NOT using the memory file system by setting use_memory_fs=False in the initialize call seems to work.
Same issue here. Launching from the terminal works. use_memory_fs=False doesn't solve the problem.
Hi, having the same problem, and use_memory_fs=False don´t solve it.
I had a similar exception when out of memory
Same situation with pandarallel 1.5.2 and python 3.6.8 - not using PyCharm but JupyterLab. JupyterLab is running in a docker container, but I explicitly increased /dev/shm size. The parameter use_memory_fs does not impact occurence of the error.
The stacktrace matches the one in https://github.com/nalepae/pandarallel/issues/76#issue-564741232.
I tried various older versions of pandarallel with no success.
same here. works fine on terminal. fails on pycharm
I have the same issue with Pycharm professional 2021.2.2, using use_memory_fs=False does not solve the problem.
Deactivating "Run with Python Console" in the run configuration solved the problem for me.
same issue when run with python file
If such problem comes, first step to do is to run apply function and see if the code is working. If your code fails due to any reason, pandarallel gives EOFError.
I have the same issue, not solve till now...
Report same issue. My code works fine when use apply()
For me, same error "Ran out of input" in VSCode when run as a python file. Works fine in a Jupyter Notebook.
same here, neither memory fs setting nor the python console trick works. I receive a slightly different error however:
File "/home/thomas/.local/lib/python3.8/site-packages/pandarallel/core.py", line 195, in
get_dataframe_and_delete_file(output_file_path) File "/home/thomas/.local/lib/python3.8/site-packages/pandarallel/core.py", line 189, in get_dataframe_and_delete_file data = pickle.load(file_descriptor) EOFError: Ran out of input
so it seems to me that the pickle file is not properly written before it is being read
same issue in Linux console when run a .py script
I had the same error, I'm using a function with lru_cache and passing it to parallel_apply with axis=1,
removing the lru_cache decorator solved the issue.
For me the error was resolved once I did not feed a lambda function to parallel_apply but a normal function instead.