digitalearthau icon indicating copy to clipboard operation
digitalearthau copied to clipboard

Duplicate datasets for multiple products on the NCI

Open omad opened this issue 7 years ago • 2 comments

During recent ingest runs (Scenes -> Albers tiles) about 50,000+ duplicate tiles have been created for the ls8_nbar_albers product in 2018.

It's possible that duplicate data will be returned for anyone using this product!

I'm currently investigating:

  1. How to safely remove the duplicates.
  2. Whether other products are also affected.
  3. What caused the bug in the first place.

omad avatar Sep 28 '18 01:09 omad

There are quite a few products with large numbers of duplicate datasets (2018 only):

prod_name actual num_dupes total
s2a_level1c_granule 42568 1298 43866
s2b_level1c_granule 45750 1857 47607
ls7_fc_albers 45491 13539 59030
ls8_fc_albers 78389 4432 82821
ls8_nbar_albers 73076 64803 137879
ls7_nbar_albers 42582 0 42582
ls8_nbart_albers 78389 48049 126438
ls7_nbart_albers 45491 0 45491
ls8_pq_albers 77270 23711 100981
ls7_pq_albers 43648 0 43648

We've also made some progress in finding the cause. Some of the AWS Lambda functions used to submit jobs on raijin were timing out at 2 minutes, and being retried a couple of minutes later. Resulting in two separate executions of the same job at the same time.

As a temporary measure we've upped the timeout to 5 minutes, and are looking into more rigorous methods to prevent this happening again.

The cause of the s2a and s2b duplicates is unrelated and will need to be addressed separately.

SQL to query for single product

SELECT
       COUNT(*) filter (where row_number = 1) as should_exist,
       COUNT(*) filter (where row_number > 1) as num_dupes,
       COUNT(*) as total
FROM (select row_number() over (partition by lat, lon, time ORDER BY metadata_doc ->> 'creation_dt') row_number,
             lat,
             lon,
             time,
             metadata_doc->>'creation_dt' as creation_dt,
             id
      from dv_ls8_pq_albers_dataset
      WHERE tstzrange('2018-01-01', '2018-12-31') && time
      ) t;

omad avatar Sep 28 '18 08:09 omad

Ingest is not re-entrant. Running ingest second time on the same product while the first run is still in progress will generate duplicate datasets that only differ by uuid (computed via non-deterministic method) and creation time.

  1. Figure out what to do for a given product
  2. Generate ingested files (with random uuids)
  3. Add to index

There are no locks of any kind, and uuid is generated at random. Having deterministic uuid computation will prevent duplicates but will not prevent wasted compute. Out of band measures to ensure that ingest is not being called concurrently are needed.

Kirill888 avatar Oct 02 '18 01:10 Kirill888