datatrove
datatrove copied to clipboard
how to postpone filter init till it's running
So it appears that currently I can't instantiate a model on a gpu because the filter object is created by the launcher, which either doesn't have a gpu, or it is most likely the wrong gpu even if it has one, since we would need a dedicated gpu(s) for each task.
Is it possible to add a 2nd init which would be the user init that will run on the actual job?
The filter task is simple - instantiate a model on a gpu and then run filter using it - of course we don't want model to be re-instantiated on every filter call.
Needing to import torch
inside the filter
is super-weird as well, but I get that it's due to pickle - but perhaps we can have two inits - one of the framework - and then another of the user.
So when a job is launched the first thing the framework runs is user defined init
if any, and then proceeds normally.
I guess I will try to overcome this meanwhile using @functools.cache
or something similar.
Thank you!
tag: @guipenedo