accelerate Could `join` replace `gather_for_metrics` and perform it automatically?

Could `join` replace `gather_for_metrics` and perform it automatically?

Open Chris-hughes10 opened this issue 1 year ago • 14 comments

Hi,

Great job with accelerate!

One persistent headache that I keep experiencing is the repeated samples during distributed evaluation. Although there is the gather_for_metrics functionality this doesn't always work depending on the output of the model; for example, Faster-RCNN in Torchvision, which outputs a list of dictionaries containing tensors.

I was wondering if it would be possible to remove the repeated behaviour completely, using something like the join context manager, which enables training on uneven outputs.

If there is appetite for this, I would be happy to help you explore options.

Thanks

Sep 07 '22 07:09 Chris-hughes10

Hey @Chris-hughes10, sorry for the delayed response on this we all had our vacations + getting back from vacations. Looking at this now and I do believe it'd be something we'd want to do!

Were you hoping for something more than just a wrapper around the context manager?

E.g. this API would likely look something such as:

with accelerator.uneven_outputs(model, optim, **kwargs):
   loss = model(input).sum()
   ...

Sep 29 '22 14:09 muellerzr

Hey @muellerzr, no worries at all, glad that you are interested.

For me, it would be better if you didn't have to use the context manager at all, and it was all taken care of behind the scenes by the accelerator, but I understand that this is not as explicit and haven't looked into the details of how this would work.

I think that this would also require changes to the BatchSampler shard, so that it supports uneven batch sizes, as well as making sure that any wrappers that are added to the model or the optimizer play nicely with joinable.

Sep 29 '22 14:09 Chris-hughes10

I also understand the benefit of that, however that does indeed consider a large amount of rewrite to get there. For now I think we'll move forward with just the small wrapper shown there, and I'll rename this github issue to reflect what would be done/proposed.

If you would like this feature (the automatic part of the above conversation) please react with a 👍 to this comment :)

Sep 29 '22 14:09 muellerzr

I agree that the small wrapper would be easier to implement to begin with, and I would be happy with this as a starting point, but I also think that the BatchSamplerShard would have to be modified at the same time. Without modifying the BatchSamplerShard, I don't think you will ever be in a position to have uneven batch sizes - as the additional samples will either be dropped or padded - so the wrapper wouldn't add any value.

Am I correct here or am I missing something?

Sep 29 '22 15:09 Chris-hughes10

@Chris-hughes10 correct, it would not support dispatched batches, only sharded/iterable datasets.

Sep 29 '22 15:09 muellerzr

It wouldn't support sharded datasets either right? As that will still pad or drop even if it isn't dispatched?

Sep 29 '22 15:09 Chris-hughes10

You're right, it would need significant work to do any of this, apologies! For now yes it's not a "noop" wrapper for those who aren't using accelerate to prepare dataloaders themselves (such as with fastai's integration that still uses their own dataloader implementation), so it would require custom dataloaders to work for now. Considering this upends the basis of accelerate's dataloader framework :)

Sep 29 '22 15:09 muellerzr

I am aware that there are a lot of parts here for what seems a small change xD. But, in my opinion, it would be very confusing to add this if it doesn't play well with the accelerate prepare and wouldn't fit my personal use cases at all; I would imagine that most people that use raw accelerate would agree.

Looking at the BatchSamplerShard and IterableDatasetShard classes in isolation, it doesn't seem like it would be too bad to modify these to return different batch sizes across processes, but I am not sure if there would be any other consequences of this elsewhere in the training loop. The simplest way I can think of for this would be to drop the padding and just return the extra samples on process 0.

Are there plans to significantly change the dataloading approach in the near future?

Sep 29 '22 15:09 Chris-hughes10

Accelerate was built when there was no join contextmanager and the fact each datalaoder returns the same number of samples in all processes is pretty ingrained in the library. We could make this evolve in the future with a major release, but that would mean a lot of changes (and breaking functionality).

For now I'd rather investigate why your use case is not supported as is (all methods should support a list of dictionaries) rather than rewrite the whole library.

Sep 29 '22 16:09 sgugger

Hey @sgugger, I completely appreciate where you are coming from, I am just genuinely curious about which parts of the library would be impacted by having an unequal number of samples in each process if a context manager was added to handle the forward and backward passes of the model.

If this is limited, I think that it may be possible to implement this in a way that doesn't require too many changes; potentially just adding an extra flag to the existing batch samplers. I would be willing to investigate this myself, but I just wanted to judge whether this is something that you guys would be interested in before I devote any time to exploring this.

Honestly, whilst this issue was inspired by a use case that is quite difficult to summarize concisely, my interest here was more generally around whether handling uneven batch sizes could be handled in a way that would not require the user to choose between the regular gather and gather_for_metrics.

Sep 29 '22 16:09 Chris-hughes10

@Chris-hughes10 This is mostly used in the whole evaluation part of Accelerate (so gather and gather_for_metrics) as for training we don't really care (plus the dataloaders very often have drop_last=True during training so there is no problem there).

I'm open to start exploring a different way to go, probably with a new flag in the accelerator. The first thing would be to have the batch sampler we have handle a non-fixed batch size (which would also be useful for training), then add this flag where we would not cycle through the dataset but return different lengths on different processes and finally add a wrapper around join.

Does that sound reasonable? If so, I can start a project summarizing the steps and we can share the work.

Oct 03 '22 13:10 sgugger

@sgugger sounds good to me, let's see how far we get!

Oct 03 '22 13:10 Chris-hughes10

Arf, GitHub changed the projects completely and I can't create an old one for Accelerate. Anyhow, here a few steps I see

[ ] Honor batch samplers with a dynamic batch size. For now BatchSampler requires a fixed batch size and won't work if the user has something more complex. This is important for training or evaluation.
[ ] Add an option to the Accelerator to not "pad" the datasets and loop back at the beginning, using the previous feature
[ ] Add a wrapper around join so that the previous step is usable for metric computation

I should be able to free some time to work on the first point at the end of this week. Then you can iterate on this for the second point?

Oct 03 '22 16:10 sgugger

That looks good to me, ping me when the first part is ready to go and I'll find some time to pick it up

Oct 03 '22 18:10 Chris-hughes10

accelerate accelerate copied to clipboard

Could `join` replace `gather_for_metrics` and perform it automatically?

accelerate
accelerate copied to clipboard