digitalearthau
digitalearthau copied to clipboard
Duplicate datasets for multiple products on the NCI
During recent ingest runs (Scenes -> Albers tiles) about 50,000+ duplicate tiles have been created for the ls8_nbar_albers product in 2018.
It's possible that duplicate data will be returned for anyone using this product!
I'm currently investigating:
- How to safely remove the duplicates.
- Whether other products are also affected.
- What caused the bug in the first place.
There are quite a few products with large numbers of duplicate datasets (2018 only):
| prod_name | actual | num_dupes | total |
|---|---|---|---|
| s2a_level1c_granule | 42568 | 1298 | 43866 |
| s2b_level1c_granule | 45750 | 1857 | 47607 |
| ls7_fc_albers | 45491 | 13539 | 59030 |
| ls8_fc_albers | 78389 | 4432 | 82821 |
| ls8_nbar_albers | 73076 | 64803 | 137879 |
| ls7_nbar_albers | 42582 | 0 | 42582 |
| ls8_nbart_albers | 78389 | 48049 | 126438 |
| ls7_nbart_albers | 45491 | 0 | 45491 |
| ls8_pq_albers | 77270 | 23711 | 100981 |
| ls7_pq_albers | 43648 | 0 | 43648 |
We've also made some progress in finding the cause. Some of the AWS Lambda functions used to submit jobs on raijin were timing out at 2 minutes, and being retried a couple of minutes later. Resulting in two separate executions of the same job at the same time.
As a temporary measure we've upped the timeout to 5 minutes, and are looking into more rigorous methods to prevent this happening again.
The cause of the s2a and s2b duplicates is unrelated and will need to be addressed separately.
SQL to query for single product
SELECT
COUNT(*) filter (where row_number = 1) as should_exist,
COUNT(*) filter (where row_number > 1) as num_dupes,
COUNT(*) as total
FROM (select row_number() over (partition by lat, lon, time ORDER BY metadata_doc ->> 'creation_dt') row_number,
lat,
lon,
time,
metadata_doc->>'creation_dt' as creation_dt,
id
from dv_ls8_pq_albers_dataset
WHERE tstzrange('2018-01-01', '2018-12-31') && time
) t;
Ingest is not re-entrant. Running ingest second time on the same product while the first run is still in progress will generate duplicate datasets that only differ by uuid (computed via non-deterministic method) and creation time.
- Figure out what to do for a given product
- Generate ingested files (with random uuids)
- Add to index
There are no locks of any kind, and uuid is generated at random. Having deterministic uuid computation will prevent duplicates but will not prevent wasted compute. Out of band measures to ensure that ingest is not being called concurrently are needed.