datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Memory overhead in multiprocessing

Open jordane95 opened this issue 10 months ago • 8 comments

When using fasttext filter, I find that the fasttext model is copied by each processes, which introduces significant memory overhead. However, to my knowledge, each fasttext model is read only and can be stored in a shared memory space across all processes.

Can we optimize the current code for memory saving? I find that using mp.manager can create shared memory and avoid memory copying. But I find it quite hard to integrate in the current code as the manager is initialized at the executor level, but not passed to each pipeline step.

jordane95 avatar Apr 24 '24 08:04 jordane95