list
list copied to clipboard
Deduplication and Matching of Cases
We are increasingly ingesting line-list data from multiple sources, and we need ways to find and flag duplicate cases.
These are some scenarios where case duplication is likely to occur:
- Ingesting VoC data separately to main country-level ingestion. For example, we upload a list of VoC cases in Canada identified in news articles and region-specific datasets, but are also parsing the country COVID-19 line list each day.
- Identifying VoC cases manually and from GISAID. We manually enter VoC cases from news articles and any other regional or national datasets we can find. Each of these have a unique way of failing to overlap with GISAID data (which has a reporting lag of ~1 month).
- Inconsistencies in datasets. If a dataset we're parsing does for whatever reason contain duplicates, it'd be great to find these. The majority of parser sources don't have UUIDs so we'll have to resort to content-based filtering.
These are some scenarios where we need case matching:
- Ingesting genetic and epidemiological datasets from the same region. Ideally we want to inform epi data with genetic inferences and vice versa, so we can ask questions like how does age/sex distribution or mortality change as VoC frequency within the population increases.
- Combining multiple imperfect data sources. Some countries have multiple source datasets which provide different slices of useful information, e.g. one source provides each case's age, sex and location, whilst another provides disease characteristics and dates. If we can match, we can fill in more fields for each case.
Solutions There are a number of ways to tackle these issues. Some are quick fixes and some longer-term proposals which would need to be validated. Here's three:
-
Content-based matching with simple logic. If more than x fields between two cases are an exact match, flag these cases. We've discussed building something on curator portal that surfaces these flagged potential duplicates for manual review, and someone can go through and click yes or no.
-
Network approach - find duplicates based on some measure of inter- vs intra-group node clustering.
-
Just accept some level of case duplication. There is bound to be some overlap of cases between disparate sources. Maybe we should accept this and make it explicit for end users.