checklistbank icon indicating copy to clipboard operation
checklistbank copied to clipboard

Indexed COL contains redundant ES records

Open mdoering opened this issue 5 months ago • 4 comments

There are many more ES records for COL XR in GBIF than there should be. The 2025.10 XR dwca contains 9.4 million Taxon.txt records, but the ES index says 14m.

This happened before. I had then removed the dataset from ES and indexed again. Sth must be wrong in the code or setup. Maybe this effects other datasets too?

mdoering avatar Nov 03 '25 15:11 mdoering

indexing the same XR into test: https://www.gbif-test.org/dataset/f29518a9-1fce-43b3-a036-eeaba739baef

mdoering avatar Nov 03 '25 15:11 mdoering

the test copy is fine with 9.xm usages

mdoering avatar Nov 04 '25 07:11 mdoering

Weird are also the 5 occurrence records from Plazi that are bound to COL. The dwca does not have any types and the 5 records using plazi taxon ids are clearly from this dataset: https://www.gbif.org/dataset/3f6c02fc-66e9-4ad6-a759-ef23da595c48

mdoering avatar Nov 04 '25 07:11 mdoering

Last time this happened it was because the HDFS directory contained data from the previous indexing run, and this was combined with the new index when imported to ES.

MattBlissett avatar Nov 05 '25 14:11 MattBlissett