fauna icon indicating copy to clipboard operation
fauna copied to clipboard

script/delete_records: Add option to match fields with regex pattern

Open joverlee521 opened this issue 2 years ago • 2 comments

Uses rethinkdb's match command to filter for records with field value that matches the provided regex pattern. See rethinkdb docs for more details: https://rethinkdb.com/api/python/match/

This was prompted by our need to delete flu sequence records that have accessions with pattern "EPIEPI". We've fixed the accession with https://github.com/nextstrain/fauna/pull/148, but we need to manually remove the old duplicate sequence records because the flu sequence table uses the accession as the index.¹

¹ https://github.com/nextstrain/fauna/blob/ec1feb679715890ae6d14efe11c979f27d6f1d6f/vdb/upload.py#L82

Checklist

  • [ ] Checks pass

joverlee521 avatar Nov 22 '23 20:11 joverlee521

Testing locally with the --preview flag:

$ envdir ../env.d/seasonal-flu/ python scripts/delete_records.py -db vdb -v flu_sequences --match "accession:^EPIEPI" --preview
Connected to the "vdb" database
Delete filters: {}
Delete matches: {'accession': '^EPIEPI'}
Delete intervals: {}
Preview: selection would delete 15933 records
Sources of deleted records: {'gisaid'}

joverlee521 avatar Nov 22 '23 20:11 joverlee521

One potential issue with this is the sequence accessions are added to the virus records during upload: https://github.com/nextstrain/fauna/blob/dda8186ed47a254e91ade0e87240d5a53e2f046b/vdb/upload.py#L477-L491

So even if we delete the "bad" accession sequence records, they are still listed in the virus records' "sequences" field. The --overwrite option for flu_upload will only append new sequences with set_union.

https://github.com/nextstrain/fauna/blob/dda8186ed47a254e91ade0e87240d5a53e2f046b/vdb/upload.py#L606-L612


Functionally, I don't think this is an issue because I cannot find any script that actually uses the "sequences"/"number_sequences" fields from the virus table. It's messy data that annoys me, but I can also ignore it if it's not important to others.

joverlee521 avatar Nov 22 '23 21:11 joverlee521