Major-TOM icon indicating copy to clipboard operation
Major-TOM copied to clipboard

Downloaded tif files are black

Open blumenstiel opened this issue 1 year ago • 4 comments

I downloaded some data and noticed that some S2 data is completely black, e.g., grid cell 207D_1378R or 438U_1009R. The S1 data looks fine.

I used the filter_download function that is provided in this repo, I tested with and without by_row. I also tested Image.open(BytesIO(table[col][0].as_py())).show() with the same result.

The tif files do not include a FillValue. I assume 0 is used for NaN values?

Is it possible that some data got corrupted during the download or upload to HF?

blumenstiel avatar Apr 26 '24 12:04 blumenstiel

Hi @blumenstiel - thanks for bringing this up! I had a look too and it does seem like these two cells are indeed corrupted.

We made no changes to the original values, so like in the original Sentinel-2 data, 0 should represent no data (as far as I'm aware).

It is somewhat unlikely that the corruption occurred during the upload, so we will investigate soon. If needed we can update the corresponding parquet file.

Are there more files that are completely black that you found?

mikonvergence avatar Apr 30 '24 13:04 mikonvergence

Hi @mikonvergence, thanks for looking into it!

I checked another 100 random samples and got 14 corrupted files:

,grid_cell
0,171D_798L
1,160D_805L
2,143D_811L
3,142D_810L
4,142D_803L
5,138D_800L
6,133D_803L
7,128D_793L
8,117D_811L
9,113D_786L
10,110D_813L
11,107D_796L
12,94D_810L
13,451U_259L

So I assume that this potentially affects 10-20% of the gird cells. I did not manually check the samples but based on my code, each of these grid cell should either have only NaN values in S1 or S2.

Maybe add a quick check after downloading/before uploading to your processing scripts?

blumenstiel avatar Apr 30 '24 14:04 blumenstiel

Hi, we're looking into this! Thanks for bringing to our attention.

Doing some digging, there is a small percentage of S2 tiles (1.3%) which have 100% no-data (==0). I guess you got very unlucky, or something about your search made them more likely? Regardless, not sure why this has happened in the first place and why it got past our checks. Seems that all the IDs you list here have nodata==1.0 in the metadata (except the last grid tile, which I manually verified and it has an image over the sea, albeit a dark one). So, for now, I recommend explicitly filtering out tiles with 100% nodata percentage (the value is a ratio between 0-1, as sometimes we get images that are partially nodata).

image

As I say, thanks for bringing this to our attention, we will look into correcting/removing these!!

aliFrancis avatar May 02 '24 15:05 aliFrancis

Thank you @aliFrancis! I forgot to look at the no-data column, this explains a lot.

blumenstiel avatar May 02 '24 15:05 blumenstiel