open_clip icon indicating copy to clipboard operation
open_clip copied to clipboard

`logit_scale` and resume

Open mitchellnw opened this issue 2 years ago • 1 comments

I'm noticing that logit_scale will steeply change directions on resume. Probably not a huge deal as it stays fairly close to 100, but worth tracking this issue in case others encounter and its indicative of some other problem with how we clip?

Screen Shot 2022-12-19 at 2 44 33 PM

mitchellnw avatar Dec 19 '22 22:12 mitchellnw

@mitchellnw that's interesting, I haven't observed that before in previous runs. I went back to check across some old resumes and even in the overlap (where there were logs before a mid-epoch crash), the resumed part that overlaps was pretty much the same +/ .2-.3

I wonder if there is something re G (large model, closer to edge of stability?). I have seen sudden drops though from 100 to high 80s without any sort of resume, and then recovery back to 100.

rwightman avatar Dec 20 '22 06:12 rwightman

closing because the hypothesis is that it relates to a filesystem issue which should not affect most

mitchellnw avatar Dec 23 '22 18:12 mitchellnw

Here's one hypothesis for what's going on. Look at the graph for logit_scale and samples/s towards the end of training -- the dips in logit_scale occur towards the end of the SCI. Perhaps this is when things are "most random" because at the beginning of the SCI it's more likely that the batch consists of many images from the same shard, especially when batch size is 160k.

Screen Shot 2022-12-25 at 3 41 25 PM

Therefore, maybe on resume, the amount of randomness was less, leading to these jumps up in logit_scale (the model wants to be more confident because it is more overfit -- it's seen batches like this before).

mitchellnw avatar Dec 25 '22 23:12 mitchellnw

@mitchellnw coming back to this one, I don't feel the explanation makes sense, logit scale dips should have no correlation with the end of SCI in terms of dataset randomness.

Each dataloader worker process across each train process (1 per GPU) is sampling with replacement a shard to read from and then reading the samples from that shard and shuffling them within a smaller shuffle buffer. There should be no noteworthy difference in data distribution that relates to the position in the checkpoint interval, even if the samples across the shards weren't shuffled (I believe @rom1504 said they were at some point?).

rwightman avatar Jan 02 '23 19:01 rwightman