ignite icon indicating copy to clipboard operation
ignite copied to clipboard

Support for multiple data sources

Open vfdev-5 opened this issue 4 years ago • 12 comments

Idea is to enhance Engine to be able to run on multiple data sources:

data1 = ...
data2 = ...
data3 = ...
...

def process_function(engine, batches):
    batch1 = batches['data1']
    batch2 = batches['data2']
    batch3 = batches['data3']
    ...

engine = Engine(process_function)
engine.run({"data1": data1, "data2": data2, "data3": data3}, max_epochs=10)

vfdev-5 avatar Jan 13 '20 23:01 vfdev-5

Question : why do we need this? Can't we wrap all sources by a single dataset, that returns a dict?

justusschock avatar Jan 24 '20 05:01 justusschock

@justusschock yes we can wrap them but it would take more headache to setup such dataset of datasets. For example, we have 2 datasets for different sizes: supervised_dataset (1k samples) and unsupervised_dataset (50k samples). It would be rather obvious, IMO, to run engine like

sup_dataloader = ...
unsup_dataloader = ...

trainer.run({"supervised": sup_dataloader, "unsupervised": unsup_dataloader}, max_epochs=10, epoch_length=len(sup_dataloader))

Another use-case can be a image-to-image gans.

What do you think ?

vfdev-5 avatar Jan 24 '20 09:01 vfdev-5

I agree that this might be good to have, but I would not add this feature to the general engine, but maybe derive a subclass for this. Maybe we can reintegrate this later, if we see, this is not to much effort, but I think this will complicate many things.

EDIT: Actually I started to work in that direction by at first enabling support for arbitrary data types such as mappings. Maybe we can postpone this until this part is done because I think, afterwards we could better estimate the amount of work

justusschock avatar Jan 24 '20 09:01 justusschock

I see, let's wait a bit with this one. Until 0.4.0 we have a lot of space for intermediate releases

vfdev-5 avatar Jan 24 '20 09:01 vfdev-5

How should the resulting batches look there?

if you provide something like trainer.run({"supervised": sup_dataloader, "unsupervised": unsup_dataloader}, max_epochs=10, epoch_length=len(sup_dataloader)) will you get something like

{ 'supervised': #whatever format the supervised batch has,
'unsupervised': #whatever format the unsupervised batch has}

?

There are a few other questions I have regarding to this:

  • Should there be separate indices/separate samplers for both of them (I think so) or shall the somehow share their sampler?
  • Same for things like prepare_batch and metric evaluation.

I think that for my personal code I would always go the way to rewrap it to a dataset, since this way you do have more control of what is happening. To be honest, this is not that much of an effort/headache:

class DummyDsetWrapper(torch.utils.data.Dataset):
    def __init__(self, sup_dset, unsup_dset):
        super().__init__()
        self.sup_dset = sup_dset
        self.unsup_dset = unsup_dset

    def __getitem__(self, index):
        return {'supervised': self.sup_dset, 'unsupervised': self.unsup_dset[random.randint(0, len(self.unsup_dset))]}

    def __len__(self):
        return len(self.sup_dset)

That way you have much more control regarding your sampling (this may even be dependent between both datasets) and all the other stuff. We can provide such a template class, but wrapping this in an engine is (in my opinion) not suitable, since this has to be customised a lot.

justusschock avatar Jan 28 '20 09:01 justusschock

I randomly saw this issue while checking in on the 0.4 release. I also feel that this should be left up to the datasets. It's not immediately clear what the semantics are if the iterators are not the same length.

Perhaps this is not the right venue for this, but I just want to offer my unsolicited $.02 on features such as this one.

IMHO the strength of ignite over other packages such as pytorch-lightening is that the engine and event system is extremely simple. It allows one to inject logic anywhere in the loop as one pleases, and it's very obvious what will happen. The reason why I don't like pytorch lightening as much is exactly because it's filled with features like this one here. It's got a huge API, and has a lot of logic along the lines of "if you want to do this, then use this API and use a None for that argument but pass a dict for this other argument, etc etc". What I love about ignite is that basically everything except the engine, even very common things like model checkpointing, is a "plugin" to be registered. If I want the model checkpointing to write 10 copies of the same file, I am fully empowered to write a handler to do so. I'm not sure how that would be done in other frameworks. (obviously a contrived example to trying to make a point, and to avoid the argument of "actually, in lightening you can do xyz")

I would personally implore you to ruthlessly keep the Engine from becoming complicated. For instance, the determinism changes added to the engine caused some roadblocks for me (#935, #941).

I absolutely understand the desire to cover more use cases for users, and you have done an amazing job so far. However, when confronted with a design decision exactly like this one, IMHO a solution that can be implemented without additions to the engine should be chosen. (eg. by providing the Dataset and DataLoader implementations under a ignite.data or ignite.contrib. A similar argument can be made about the determinism stuff: subclasses of Engine and optional datasets should be preferred.

Thanks for your continued work on an amazing library :)

amatsukawa avatar Jun 17 '20 22:06 amatsukawa

@amatsukawa thanks for your feedback, I really appreciate it :)

I understand your point about our willingness of complicating API and thanks a lot for saying that ! I would like to have your opinion on the Enigne of 0.4.0 RC where we removed all those deterministic patching (and put into DeterministicEngine if user would like to have 0.3.0 Engine). Does it solves the issues you encountered with previous 0.3.0 version ?

About this issue, definitely, we should rethink it and maybe cover it without complicating Engine itself and using additional helper structures.

And, please, do not hesitate to comment out other issues if they may seem to have a similar impact as this one. It would help us to be more "connected" with our users :)

vfdev-5 avatar Jun 17 '20 22:06 vfdev-5

@vfdev-5 yes, by the looks of it, it does seem to solve our problems. That's why I was checking up on the official 0.4 release when I came across this in one fo the kanban boards :)

I have yet to actually try it, but if I can find some time next week, I will install the rc and give it a go.

amatsukawa avatar Jun 17 '20 23:06 amatsukawa

@vfdev-5 sorry to keep hijacking this thread, but I've already done it so I might as well do it one last time :) Re: trying the rc, I'm wondering what your ETA is for releasing 0.4? I'm still on 0.2.1 (didn't want to install a nightly in prod) so I'd like to give it a try to make sure we can upgrade to 0.4, just trying to see how much time I have to do so. Thanks!

amatsukawa avatar Jun 18 '20 21:06 amatsukawa

sorry to keep hijacking this thread

@amatsukawa no problems at all :)

We are planning to release it before June 30th : fixing "bugs" from https://github.com/pytorch/ignite/projects/7 and updating README, docs etc.

vfdev-5 avatar Jun 18 '20 21:06 vfdev-5

Idea is to enhance Engine to be able to run on multiple data sources:

data1 = ...
data2 = ...
data3 = ...
...

def process_function(engine, batches):
    batch1 = batches['data1']
    batch2 = batches['data2']
    batch3 = batches['data3']
    ...

engine = Engine(process_function)
engine.run({"data1": data1, "data2": data2, "data3": data3}, max_epochs=10)

great Idea! ... I usually due to this had to mention the data again and again and cant use create_supervised_trainer()

bibhabasumohapatra avatar Jan 03 '22 03:01 bibhabasumohapatra

IMHO the strength of ignite over other packages such as pytorch-lightening is that the engine and event system is extremely simple. It allows one to inject logic anywhere in the loop as one pleases, and it's very obvious what will happen. The reason why I don't like pytorch lightening as much is exactly because it's filled with features like this one here. It's got a huge API, and has a lot of logic along the lines of "if you want to do this, then use this API and use a None for that argument but pass a dict for this other argument, etc etc". What I love about ignite is that basically everything except the engine, even very common things like model checkpointing, is a "plugin" to be registered. If I want the model checkpointing to write 10 copies of the same file, I am fully empowered to write a handler to do so. I'm not sure how that would be done in other frameworks. (obviously a contrived example to trying to make a point, and to avoid the argument of "actually, in lightening you can do xyz")

but this sounds extremely correct too!! .

bibhabasumohapatra avatar Jan 03 '22 03:01 bibhabasumohapatra