Indexed COL contains redundant ES records
There are many more ES records for COL XR in GBIF than there should be. The 2025.10 XR dwca contains 9.4 million Taxon.txt records, but the ES index says 14m.
This happened before. I had then removed the dataset from ES and indexed again. Sth must be wrong in the code or setup. Maybe this effects other datasets too?
indexing the same XR into test: https://www.gbif-test.org/dataset/f29518a9-1fce-43b3-a036-eeaba739baef
the test copy is fine with 9.xm usages
Weird are also the 5 occurrence records from Plazi that are bound to COL. The dwca does not have any types and the 5 records using plazi taxon ids are clearly from this dataset: https://www.gbif.org/dataset/3f6c02fc-66e9-4ad6-a759-ef23da595c48
Last time this happened it was because the HDFS directory contained data from the previous indexing run, and this was combined with the new index when imported to ES.