pandarallel
pandarallel copied to clipboard
Processes stopped when passing large objects to function to be parallelized
Problem:
Apply a NLP Deep Learning model for Text Geneartion over the rows of a Pandas Series. The function call is:
out = text_column.parallel_apply(lambda x: generate_text(args, model, tokenizer, x))
where args, tokenizer are light objects but model is a heavy object, storing a Pytorch model which weighs more than 6GB on secondary memory and takes up ~12GB RAM when running it.
I have been doing some tests and the problem arises only when I pass the heavy model to the function (even without effectively running it inside the function), so it seems that the problem is passing an object as argument that takes up a lot of memory. (Maybe related with the Sharing Memory strategy for parallel computing.)
After running the parallel_apply the output I get is:
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data tranfer (pipe) to transfer data between the main process and workers.
0.00% | 0 / 552 |
0.00% | 0 / 552 |
0.00% | 0 / 551 |
0.00% | 0 / 551 |
0.00% | 0 / 551 |
0.00% | 0 / 551 |
0.00% | 0 / 551 |
0.00% | 0 / 551 |
And it gets stuck there forever. Indeed, there are two processed spawned and both are stopped:
ablanco+ 85448 0.0 4.9 17900532 12936684 pts/27 Sl 14:41 0:00 python3 text_generation.py --input_file input.csv --model_type gpt2 --output_file out.csv --no_cuda --n_cpu 8
ablanco+ 85229 21.4 21.6 61774336 57023740 pts/27 Sl 14:39 2:26 python3 text_generation.py --input_file input.csv --model_type gpt2 --output_file out.csv --no_cuda --n_cpu 8
Hello,
- First could tell me if this issue arises also with classical
pandas(if no, we are sure it is exclusively apandarallelissue) - Could you also please try without progress bar and without using memory filesystem ? (
pandarallel.initialize(use_memory_fs=False)).
I guess it won't work, but maybe it could give me more information about the topic.
Actually, to serialize lambda functions, pandarallel uses dill.
Because dill is very slow compared to classical Python serialisation, pandarallel uses dill only to serialise the function to apply, the rest (dataframe and all) are serialized with standard Python serialisation.
But, unfortunately in your case, the function to apply is huge, because it contains model.
Could you also tell me how much RAM do you have, the RAM usage during your pandarallel call.
And if you have time, could you try with only 2 workers ? (or even 1 worker. Of course 1 worker is useless compared to classical pandas, but at least it uses pandarallel mechanism).
My guesses are the following :
- Either
pandarallelis working, but the serialization of yourmodeltakes a long time, so the function to apply is not yet totally received by worker processes (progress bars really begins to go when some data are treated by workers. During (de)serialization they stay to 0%) - Either you run out of memory.
pandarallelis optimized to consume as few RAM as possible concerning the dataframe, but the function to apply is copiedntimes in memory if you havenworker. Usually the function itself is very light.
Hi @nalepae, thank you for your detailed and fast answer.
- First could tell me if this issue arises also with classical pandas (if no, we are sure it is exclusively a pandarallel issue)
Yes, if I replace the parallel_apply function with the standard apply function everything works correctly (but slow)
- Could you also please try without progress bar and without using memory filesystem ? (pandarallel.initialize(use_memory_fs=False)).
Thank for the suggestions. Same behaviour.
Could you also tell me how much RAM do you have, the RAM usage during your pandarallel call.
This is the output of free -m during the pandarallel call. I think that free RAM is not the problem.
total used free shared buff/cache available
Mem: 257672 52909 9537 51 195225 203649
Swap: 4095 590 3505
And if you have time, could you try with only 2 workers ? (or even 1 worker. Of course 1 worker is useless compared to classical pandas, but at least it uses pandarallel mechanism).
I have just tried setting nb_workers=1 and nothing changes.
INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use standard multiprocessing data tranfer (pipe) to transfer data between the main process and workers.
0.00% | 0 / 6 |
Please, tell me whatever you need and thanks again.
Also ran into the issue, took forever to debug, as the argument itself was actually part of self ...
Still have lots of RAM, so the serialization guess seems to be spot on, considering that at KeyboardInterrupt the Traceback mostly goes into dill and pickle.
Here is reproducible code:
import numpy as np
import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=1, use_memory_fs=False)
class A:
def __init__(self, var1):
self.var1 = var1
def f(self, *args):
pass
def run(self):
df = pd.DataFrame(dict(a=np.random.rand(100)))
df.apply(lambda x: self.f(x), axis=1)
print("apply is ok")
df.parallel_apply(lambda x: self.f(x), axis=1) # hangs if self.var1 is too big
print("parallel is ok")
if __name__=="__main__":
a_list = [1]*1024*1024*1024
a = A(a_list)
a.run()
Produces:
INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
apply is ok
And hangs...
Appreciate your work! @nalepae
Currently fixed by upgrading python to 3.7.6 from 3.7.4, apparently the problem was with pickle.
For those who seek why a single process is running indefinitely with no results: I was on 3.6.4 and upgrading to 3.7.6 fixed the issue. Still no luck with progress bars, sadly.
I got around this by setting the function parameters to global variables.