digitalearthau Duplicate datasets for multiple products on the NCI

During recent ingest runs (Scenes -> Albers tiles) about 50,000+ duplicate tiles have been created for the ls8_nbar_albers product in 2018.

It's possible that duplicate data will be returned for anyone using this product!

I'm currently investigating:

How to safely remove the duplicates.
Whether other products are also affected.
What caused the bug in the first place.

Sep 28 '18 01:09 omad

There are quite a few products with large numbers of duplicate datasets (2018 only):

prod_name	actual	num_dupes	total
s2a_level1c_granule	42568	1298	43866
s2b_level1c_granule	45750	1857	47607
ls7_fc_albers	45491	13539	59030
ls8_fc_albers	78389	4432	82821
ls8_nbar_albers	73076	64803	137879
ls7_nbar_albers	42582	0	42582
ls8_nbart_albers	78389	48049	126438
ls7_nbart_albers	45491	0	45491
ls8_pq_albers	77270	23711	100981
ls7_pq_albers	43648	0	43648

We've also made some progress in finding the cause. Some of the AWS Lambda functions used to submit jobs on raijin were timing out at 2 minutes, and being retried a couple of minutes later. Resulting in two separate executions of the same job at the same time.

As a temporary measure we've upped the timeout to 5 minutes, and are looking into more rigorous methods to prevent this happening again.

The cause of the s2a and s2b duplicates is unrelated and will need to be addressed separately.

SQL to query for single product

SELECT
       COUNT(*) filter (where row_number = 1) as should_exist,
       COUNT(*) filter (where row_number > 1) as num_dupes,
       COUNT(*) as total
FROM (select row_number() over (partition by lat, lon, time ORDER BY metadata_doc ->> 'creation_dt') row_number,
             lat,
             lon,
             time,
             metadata_doc->>'creation_dt' as creation_dt,
             id
      from dv_ls8_pq_albers_dataset
      WHERE tstzrange('2018-01-01', '2018-12-31') && time
      ) t;

Sep 28 '18 08:09 omad

Ingest is not re-entrant. Running ingest second time on the same product while the first run is still in progress will generate duplicate datasets that only differ by uuid (computed via non-deterministic method) and creation time.

Figure out what to do for a given product
Generate ingested files (with random uuids)
Add to index

There are no locks of any kind, and uuid is generated at random. Having deterministic uuid computation will prevent duplicates but will not prevent wasted compute. Out of band measures to ensure that ingest is not being called concurrently are needed.

Oct 02 '18 01:10 Kirill888

digitalearthau digitalearthau copied to clipboard

Duplicate datasets for multiple products on the NCI

digitalearthau
digitalearthau copied to clipboard