yoyodyne Consistent batch size

I have a need (discussed off-thread) to make it so that every batch is the same size. This seems to me to require the following:

datasets know the length of the longest source, features (if present), and target (@property is appropriate here)
there is a flag (call it --pad_to_max or something) which when enabled causes source, features, and target (respectively) to be padded to the appropriate length from (1) respectively; it just needs to be passed as pad_len to the batches.PaddedTensor constructor to achieve this

I will put this aside until #40 is completed, however, since this may interact in weird ways with that.

Feb 08 '23 14:02 kylebgorman

We could consider following huggingface here? Though in our case we would not default to no padding, and I think padding=True is pretty terrible since all other potential arguments are strings, and we should add another string argument (or simply leaving that behavior as the default).

https://huggingface.co/docs/transformers/pad_truncation

Feb 09 '23 01:02 Adamits

What specifically in that doc do you want to follow? My thoughts:

I never want to truncate the source string.
I never want to truncate the target string during training; if I have one that is too long, I want to filter it out using simple UNIX tools instead.
I am fine with incremental decoding having an upper bound on target length that matches the one used in training; if that results in truncated predictions, so be it.

Feb 09 '23 15:02 kylebgorman

Sorry, I thought I had link to a specific table:

Truncation	Padding	Instruction
no truncation	no padding	tokenizer(batch_sentences)
	padding to max sequence in batch	tokenizer(batch_sentences, padding=True) or
		tokenizer(batch_sentences, padding='longest')
	padding to max model input length	tokenizer(batch_sentences, padding='max_length')
	padding to specific length	tokenizer(batch_sentences, padding='max_length', max_length=42)

Basically, just that we can have a padding argument that accepts strings like: longest, max_length, etc, and have an optional max_length argument. Looking more closely and thinking about it, though, most of these options are not really useful for us. I think we just want something like the longest and max_length options. Anyway, I think max_batch_length and max_dataset_length or something similar would be clearer. So maybe ignore HF :).

Feb 09 '23 16:02 Adamits

Yeah, I think we just want a flag that opts us into padding='max_length' (subject to --max_source_length and/or --max_target_length constraints) rather than the default padding='longest'.

Truncation makes more sense in BERT-ish land for other reasons, seems to me.

Feb 09 '23 16:02 kylebgorman

Sounds good!

Yes when you are training an LM on all the data you can possibly find, truncating is not such a problem :).

Feb 09 '23 16:02 Adamits

Now that #40 is closed and such, I am wondering: what is the best way to enable this?

I think it should be a flag --pad_max which causes the padding to be as long as the longest source, features, and target length (respectively), with --no_pad_max the default.
Which module is responsible for computing the max for source, target, and features? Dataset? Dataconfig?
Is it okay for the warning or exception to be raised as in #51 in --pad_max mode? That is, we'd just let it happen naturally, on the first draw from the collator?

This information seems to need to live in the collator as a member. Then when it is asked to make a tensor, we'll pass pad_len=... to the appropriate max.

Any thoughts on this design @Adamits?

Feb 11 '23 02:02 kylebgorman

I think it should be a flag --pad_max which causes the padding to be as long as the longest source, features, and target length (respectively), with --no_pad_max the default.

This sounds good. We probably should add documentation for padding (default pads to batch max, this arg sets it to dataset max)

Which module is responsible for computing the max for source, target, and features? Dataset? Dataconfig?

Dataset makes the most sense to me. I think the DataConfig just knows how to translate an input file into a sample. Dataset, to me, makes sense as the place that tracks all properties of the actual dataset. I think this should be easy for now, since we store all the samples on the dataset. If we ever wanted to change this so that we don't store all samples in memory (not sure we need to care about that?), it would probably be a little tricker. We would need to preprocess just to find the max sequence lengths.

Is it okay for the warning or exception to be raised as in Adds Exception for inputs > maximum_source_length #51 in --pad_max mode? That is, we'd just let it happen naturally, on the first draw from the collator?

This is a good question. I think we should let it raise. However, it is worth considering raising a different message if pad_max is used? Probably not necessary since the obvious solution is just to increase max_source_length.

This information seems to need to live in the collator as a member. Then when it is asked to make a tensor, we'll pass pad_len=... to the appropriate max.

Agreed. So in this way, we can get the pad max on the dataset, pass it to the collator, and if that attribute is not None, we always pad to it when running the collator. I think this sounds good!

Any thoughts on this design @Adamits?

Feb 22 '23 17:02 Adamits

Okay, so:

--pad_max is the flag.
Dataset computes the max(es)
This gets passed as the pad_max argument in batches.PaddedTensor calls via the collators.
I may add a TODO to tweak the error later.

Feb 22 '23 20:02 kylebgorman

I now have an early draft of this here, in the max branch of my fork. Important simplifications while I get the basics right:

I assume that we are always padding to the max (for source and target); I can make it optional later.
I am ignoring feature models for now; that can be incorporated later.

I set the actual source_max_length to the min(max(longest source string in train, longest source string in dev), --max_source_length)), and similarly for target length. I then lightly modify the LSTM (it needs to be told the max source length) and the transformer (it needs to make the positional embedding as large as max of the longest source and target string). Everything else is just plumbing.

If you have some time I'd appreciate your input @Adamits...I could generate an early PR for review if that'd help.

May 29 '23 22:05 kylebgorman