datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

how to postpone filter init till it's running

Open stas00 opened this issue 7 months ago • 5 comments

So it appears that currently I can't instantiate a model on a gpu because the filter object is created by the launcher, which either doesn't have a gpu, or it is most likely the wrong gpu even if it has one, since we would need a dedicated gpu(s) for each task.

Is it possible to add a 2nd init which would be the user init that will run on the actual job?

The filter task is simple - instantiate a model on a gpu and then run filter using it - of course we don't want model to be re-instantiated on every filter call.

Needing to import torch inside the filter is super-weird as well, but I get that it's due to pickle - but perhaps we can have two inits - one of the framework - and then another of the user.

So when a job is launched the first thing the framework runs is user defined init if any, and then proceeds normally.

I guess I will try to overcome this meanwhile using @functools.cache or something similar.

Thank you!

tag: @guipenedo

stas00 avatar Jul 09 '24 01:07 stas00