metadata perf: collator with padding for tpu

Imitating https://huggingface.co/docs/accelerate/quicktour.html#training-on-tpu

I also see that there's also a padding procedure in https://github.com/bigscience-workshop/metadata/blob/master/bsmetadata/experiments/sample.py#L16 Not really sure:

what we want here for both TPU and GPU (which can be longest padding per batch);
whether truncation is also what we want.

Sep 06 '21 14:09 tianjianjiang

Indeed, on TPU the length of the samples is important, you are right.

As a preliminary question, I was under the impression that the method add_metadata_and_chunk_examples in metadata_utils.py already returns samples of the same size. If this is the case, it is as if all examples had been padded/truncated to the same size. Am I missing something here? :slightly_smiling_face:

Sep 06 '21 14:09 SaulLu

As a preliminary question, I was under the impression that the method add_metadata_and_chunk_examples in metadata_utils.py already returns samples of the same size. If this is the case, it is as if all examples had been padded/truncated to the same size. Am I missing something here? 🙂

I was under the same impression but then I realized that perhaps we are not truncating them at all, therefore my Colab notebook always warns me about the length. Admittedly it is probably not really about padding but truncation. If that's the case, then we may also have to modify the tokenizer.

For the padding itself, I guess the only useful situation is for the control group without metadata.

Sep 06 '21 14:09 tianjianjiang

therefore my Colab notebook always warns me about the length.

Please correct me if you disagree, but I think the warning comes from the line 69 of metadata_utils. In this line it is perfectly intentional that the sequence is longer than the maximum length, since it is then divided into sequences of the right length in the loop line 84.

Sep 06 '21 15:09 SaulLu

therefore my Colab notebook always warns me about the length.

Please correct me if you disagree, but I think the warning comes from the line 69 of metadata_utils. In this line it is perfectly intentional that the sequence is longer than the maximum length, since it is then divided into sequences of the right length in the loop line 84.

Hi, I'm gonna be busy for at least a couple of days so forgive me if I am not responsive or thorough. As far as I can tell, the warnings happened earlier than those lines even without metadata. Admittedly they are probably not that critical. I will check carefully later and get back to you.

Sep 07 '21 05:09 tianjianjiang

After truncation, the warnings are gone, and the training and the evaluation are two times faster. More importantly, it seems making the control group fairer. Previously the control group without metadata can have longer texts than the experiment group with metadata, and the former's perplexities was much lower than the latter. Now the trend is reversed as expected.

Sep 08 '21 23:09 tianjianjiang

I think I am missing something in these changes. I'll try to explain how I saw things and if I'm wrong somewhere I'd be very happy to hear from you! In any case, we're doing a research code, so I think we should not hesitate to experiment in small iterations that will allow us to identify problems, new topics: and you're raising some interesting issues here!

From my point of view, this is what has been happening so far, for an example that was handled by add_metadata_and_chunk_examples:

We (possibly) add local metadata to the plain text (this increases the size of the example)
We then tokeniz this example with metadata. Since the examples in our dataset can be very very long, this could result in number of tokens for our sequences >> model.max_seq_len [the warning is raised here]
We cut this example in pieces in order to have only examples of size < model.max_seq_len

Finally with this operation, an example is transformed into several examples but each of size < model.max_seq_len (and even for all except the last one are equal to model.max_seq_len ). Which brings me to a side suggestion: shouldn't we remove that last example which has a different size?

I would therefore have the impression that the change you propose will reduce the number of examples in our training dataset. If that's the case, is that really what we want?

Then, regarding the max_seq_len for the "without_metadata" code. Similarly to the script with metadata, what was done here was to tokenize the examples and then from these examples create several examples of the same size (of equal length to block_size this time).

I really agree that it would be much better to have a comparable metric! I totally share your point of view on the fact that the perplexities are not comparable between a dataset with metadata and without metadata (if the first one does not take into account in the calculation of the loss the tokens corresponding to the metadata). Unfortunately with a block_size chosen like this I'm afraid it's still not comparable as the number of tokens associated with metadata changes for each example. Also, as I mentioned to @timoschick yesterday, I think that it would be great to also try to take into account local metadata in the loss with a special token at the beginning of the sequence saying whether or not you want that metadata to be include. This new framework would allow to compare perplexities calculated on sequences of the same length for local metadata. However, this would only be for this particular case and not for the case where loss is not taken into account. Moreover, for global metadata, I don't see a framework that could have a fixed length. So, for now the only solution I can think of is to take examples for the baseline that correspond exactly to each example with metadata added but without its metadata (so that we can compare perplexities calculated on the same length of sequences).

I look forward to reading your thoughts on this! :hugs:

Sep 09 '21 08:09 SaulLu

I think I am missing something in these changes. I'll try to explain how I saw things and if I'm wrong somewhere I'd be very happy to hear from you! In any case, we're doing a research code, so I think we should not hesitate to experiment in small iterations that will allow us to identify problems, new topics: and you're raising some interesting issues here!

I was also debating with myself whether to do it this time or make another PR. So for now I changed this one to a draft. Hopefully we can have it done once and for all (not really all but you know what I mean...).

(Pardon me that I'm gonna skip quoting the whole thing this time.)

If I understand the code and your point correctly, I believe that I share a pretty similar concern with you. The truncation this PR does so far is merely a WIP-style checkpoint (that I should have totally clarified it beforehand, my apologies). Admittedly, when there is already a series of procedures that cuts pieces and groups blocks, truncating before it feels totally redundant and even disruptive.

The main reason why I want to try truncation is that, when splitting concatenated examples into blocks/chunks, I believe those were not the heads of the original examples may have different behaviors to autoregressive LM, and then the perplexity may be less intuitive (to me).

For example, assuming a super long original example says "wubba lubba dub dub!", the concat-and-split would produce 2 blocks being "wubba lubba dub" and "dub!" while truncating in advance would produce just one block as "wubba lubba dub", and my guess is that an autoregressive LM may prefer the second case.

In a way it is exactly like what you also mentioned "remove that last example which has a different size", but I may go even further to drop all (regardless the sizes) but the first, because I don't know whether the second to the last blocks make sense to what kind of LM. This whole speculation is probably what bothers/interests me the most.

On the bright side, perhaps we can have both configurations being different control/experiment groups? (But I can't shake the feeling that somebody must have done that already...)

So, a truncation before concat-and-split may virtually render the latter almost no-op, which is actually the rial-and-error I want to have. In this sense, you are absolutely right about the decreased number of examples, and yet I am not confident to recommend this trade-off.

As for the comparability in general, I am thinking that maybe we can have another kind of control group that prepends noises in the same length as the corresponding metadata.

🍀

Sep 09 '21 15:09 tianjianjiang

From my point of view, this is what has been happening so far, for an example that was handled by add_metadata_and_chunk_examples: [...]

Yes, that's exactly what happens with the current code.

Shouldn't we remove that last example which has a different size?

I don't have a strong opinion here. Does any of you know how this is typically done? What would be the downside of having an example with different size (except being less memory-efficient because we throw away all computations done on the padding tokens)?

In a way it is exactly like what you also mentioned "remove that last example which has a different size", but I may go even further to drop all (regardless the sizes) but the first.

Whether this is a reasonable thing to do strongly depends on how our examples look like (i.e., how many tokens a single example contains). In your toy example (the "wubba lubba dub dub!" one), I agree that it's probably best not to train on the second (very short) block at all. However, we will probably have much longer examples in practice - let's say an entire paragraph from Wikipedia. Throwing away everything but the first 512 (or 1,024) tokens from this paragraph would drastically reduce the number of examples in our training dataset; we'd be throwing away lots of training examples that the model could learn from. Or did I misunderstand something about your proposal?

Sep 09 '21 16:09 timoschick

However, we will probably have much longer examples in practice - let's say an entire paragraph from Wikipedia. Throwing away everything but the first 512 (or 1,024) tokens from this paragraph would drastically reduce the number of examples in our training dataset; we'd be throwing away lots of training examples that the model could learn from. Or did I misunderstand something about your proposal?

I think I understand the concern here. My toy example may be too simple to address my feeling about the second to the last sub-examples that may or may not be useful to an autoregressive LM. When we have long paragraphs from mC4, for instance, I imagine some of those chunked sub-examples may be sequences that are in the middle of something. Honestly I don't really know whether they are just harmless (if not useful) abstract n-grams to the LM or certain broken (left side) context that could have some negative impact.

(I don't have much experience in this situation because my tasks usually presume that we do sentence segmentation first.)

Sep 09 '21 16:09 tianjianjiang

metadata metadata copied to clipboard

perf: collator with padding for tpu

metadata
metadata copied to clipboard