pandarallel icon indicating copy to clipboard operation
pandarallel copied to clipboard

parallel_apply results in EOFError when run from Pycharm, works fine from Jupyter Notebook

Open KubaMichalczyk opened this issue 5 years ago • 18 comments

I was trying to parallelise my code with pandarallel package in the following way:

import pandas as pd
from sklearn.cluster import SpectralClustering
from pandarallel import pandarallel
import numpy as np
ex = {'measurement_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1}, 'time': {0: 30000, 1: 30000, 2: 30000, 3: 30000, 4: 30000, 5: 30000, 6: 30000, 7: 30000, 8: 30000, 9: 30000, 10: 30100, 11: 30100, 12: 30100, 13: 30100, 14: 30100, 15: 30100, 16: 30100, 17: 30100, 18: 30100, 19: 30100}, 'group': {0: '0', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0'}, 'object': {0: 'obj1', 1: 'obj10', 2: 'obj2', 3: 'obj3', 4: 'obj4', 5: 'obj5', 6: 'obj6', 7: 'obj7', 8: 'obj8', 9: 'obj9', 10: 'obj1', 11: 'obj10', 12: 'obj2', 13: 'obj3', 14: 'obj4', 15: 'obj5', 16: 'obj6', 17: 'obj7', 18: 'obj8', 19: 'obj9'}, 'x': {0: 55.507999420166016, 1: 49.67399978637695, 2: 61.9640007019043, 3: 67.98300170898438, 4: 49.43199920654297, 5: 40.34000015258789, 6: 69.50399780273438, 7: 49.65800094604492, 8: 68.48200225830078, 9: 37.87900161743164, 10: 55.595001220703125, 11: 49.52399826049805, 12: 61.92499923706055, 13: 67.91799926757812, 14: 49.30099868774414, 15: 40.141998291015625, 16: 69.49299621582031, 17: 49.775001525878906, 18: 68.4010009765625, 19: 37.77899932861328}}

ex = pd.DataFrame.from_dict(ex).set_index(['measurement_id', 'time', 'group'])
    
def cluster(x, index):
    x = np.asarray(x)[:, np.newaxis]
    
    clustering = SpectralClustering(n_clusters = 3, random_state = 42, gamma = 1 / 50).fit(x)
    return pd.Series(clustering.labels_ + 1, index = index)
    
pandarallel.initialize(nb_workers=2, progress_bar=True)
ex \
    .groupby(['measurement_id', 'time', 'group']) \
    .parallel_apply(lambda x: cluster(x['x'], x['object']))

However, when I'm running this on Pycharm I get the following error:

  File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-84-7c89aedcfad4>", line 13, in <module>
    .parallel_apply(lambda x: cluster(x['x'], x['object']))
  File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 451, in closure
    map_result,
  File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 358, in get_workers_result
    message_type, message = queue.get()
  File "<string>", line 2, in get
  File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod
    kind, result = conn.recv()
  File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/kuba/anaconda3/envs/test_env/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

I thought that this is maybe due to some incompatibility with the latest pandas or python release and tried to recreate the issue with different environment on Jupyter Notebook. It worked well so I tested the same environment on Jupyter notebook - it worked fine. I made sure that I'm running the same environment with

import sys
print(sys.executable)

and this is indeed a case. So the only difference seems to that I use PyCharm instead of Jupyter Notebook. My environment is set up with Python 3.7.6 and pandas 1.0.1.

KubaMichalczyk avatar Feb 13 '20 14:02 KubaMichalczyk

I have experienced the same exact issue.

platypus1989 avatar Mar 11 '20 18:03 platypus1989

Same for me in PyCharm. Havn't tried a different IDE yet. However, NOT using the memory file system by setting use_memory_fs=False in the initialize call seems to work.

moritzwilksch avatar Jun 24 '20 13:06 moritzwilksch

Same issue here. Launching from the terminal works. use_memory_fs=False doesn't solve the problem.

nbrosse avatar Sep 16 '20 13:09 nbrosse

Hi, having the same problem, and use_memory_fs=False don´t solve it.

aikanarov avatar Oct 07 '20 14:10 aikanarov

I had a similar exception when out of memory

FRiMN avatar Feb 17 '21 17:02 FRiMN

Same situation with pandarallel 1.5.2 and python 3.6.8 - not using PyCharm but JupyterLab. JupyterLab is running in a docker container, but I explicitly increased /dev/shm size. The parameter use_memory_fs does not impact occurence of the error.

The stacktrace matches the one in https://github.com/nalepae/pandarallel/issues/76#issue-564741232.

I tried various older versions of pandarallel with no success.

alexanderwiller avatar Mar 16 '21 18:03 alexanderwiller

same here. works fine on terminal. fails on pycharm

erichlin avatar May 26 '21 15:05 erichlin

I have the same issue with Pycharm professional 2021.2.2, using use_memory_fs=False does not solve the problem.

imahdimir avatar Sep 22 '21 05:09 imahdimir

Deactivating "Run with Python Console" in the run configuration solved the problem for me.

schillingalex avatar Dec 10 '21 08:12 schillingalex

same issue when run with python file

JKHenry520 avatar Apr 25 '22 09:04 JKHenry520

If such problem comes, first step to do is to run apply function and see if the code is working. If your code fails due to any reason, pandarallel gives EOFError.

mufassir-khan avatar Jun 13 '22 09:06 mufassir-khan

I have the same issue, not solve till now...

Pyramiding avatar Jun 21 '22 13:06 Pyramiding

Report same issue. My code works fine when use apply()

heya5 avatar Jun 29 '22 16:06 heya5

For me, same error "Ran out of input" in VSCode when run as a python file. Works fine in a Jupyter Notebook.

AlexTo avatar Jul 12 '22 05:07 AlexTo

same here, neither memory fs setting nor the python console trick works. I receive a slightly different error however:

File "/home/thomas/.local/lib/python3.8/site-packages/pandarallel/core.py", line 195, in get_dataframe_and_delete_file(output_file_path) File "/home/thomas/.local/lib/python3.8/site-packages/pandarallel/core.py", line 189, in get_dataframe_and_delete_file data = pickle.load(file_descriptor) EOFError: Ran out of input

so it seems to me that the pickle file is not properly written before it is being read

thomas-icomplai avatar Jul 26 '22 07:07 thomas-icomplai

same issue in Linux console when run a .py script

hanlinGao avatar Nov 07 '22 12:11 hanlinGao

I had the same error, I'm using a function with lru_cache and passing it to parallel_apply with axis=1, removing the lru_cache decorator solved the issue.

sadakmed avatar Nov 24 '22 13:11 sadakmed

For me the error was resolved once I did not feed a lambda function to parallel_apply but a normal function instead.

LukasHaas avatar Dec 01 '22 18:12 LukasHaas