datatrove Memory overhead in multiprocessing

Memory overhead in multiprocessing

Open jordane95 opened this issue 10 months ago • 8 comments

When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only and can be stored in a shared memory space across all processes.

Can we optimize the current code for memory saving? I find that using mp.manager can create shared memory and avoid memory copying. But I find it quite hard to integrate in the current code as the manager is initialized at the executor level, but not passed to each pipeline step.

Apr 24 '24 08:04 jordane95

datatrove datatrove copied to clipboard

Memory overhead in multiprocessing

datatrove
datatrove copied to clipboard