ssl-legacysurvey icon indicating copy to clipboard operation
ssl-legacysurvey copied to clipboard

Bug in RGB augmentation ?

Open EiffL opened this issue 1 year ago • 2 comments

Trying to understand exactly what transformation to apply to some new raw data I want to get embeddings for, I noticed what I think might be a little bug here: https://github.com/georgestein/ssl-legacysurvey/blob/277a492a8fc7d3fa6e08945dc5fb88ed6991671b/ssl_legacysurvey/data_loaders/decals_augmentations.py#L406

This does not seem to match the d2_rgb formula which has for this line:

rgb[:,:,plane] = (img * scale + m) * fI / I

This translates into the following effect. Here is how a galaxy looks like in an imshow if I apply ToRGB to the raw images: image And here is what it looks like if I instead use the dr2_rgb function: image

It is less saturated and more details are visible.

I don't think it is a huge deal, but I would bet that the quality of the embeddings would actually be even better with this range compression formula.

If indeed this was unintended, I'm happy to make a PR/new branch.

EiffL avatar Sep 28 '23 01:09 EiffL

Nice catch!

Looks like you are correct and the scaling used for training/inference indeed had the bug you point out. At least it was consistent over both. But I agree the default scaling would very likely improve things further and should be used in any future versions.

In the visualizations the scaling is correct as they use ssl_legacysurvey/utils/to_rgb.py and not the bugged version in the decals ssl augmentations, but it seems when I vectorized that for loop over for img,band in zip(imgs, bands) I moved a parentheses where it shouldn't be...

The pretrained models we release should keep this scaling, but anything new should be trained with the corrected one. Please make a new branch for the fix - I wouldn't want the change to be merged into master and then accidentally used in inference when loading in the previously trained models.

georgestein avatar Sep 28 '23 02:09 georgestein

Yeah exactly, so... maybe best to leave as-is because at least it's consistent. I find that the similarity searches are pretty 'similar' whether I use the fixed scaling or not, and that's probably thanks to the color jitter augmentations used during training.

Maybe one way to fix it is that if someone retrains these models, they could introduce a new "ToRGBv2" augmentation that would fix the problem.

Anyways, I mostly cared to know how to properly normalize the data :-)

EiffL avatar Sep 28 '23 16:09 EiffL