script/delete_records: Add option to match fields with regex pattern
Uses rethinkdb's match command to filter for records with field value
that matches the provided regex pattern. See rethinkdb docs for more
details: https://rethinkdb.com/api/python/match/
This was prompted by our need to delete flu sequence records that have accessions with pattern "EPIEPI". We've fixed the accession with https://github.com/nextstrain/fauna/pull/148, but we need to manually remove the old duplicate sequence records because the flu sequence table uses the accession as the index.¹
¹ https://github.com/nextstrain/fauna/blob/ec1feb679715890ae6d14efe11c979f27d6f1d6f/vdb/upload.py#L82
Checklist
- [ ] Checks pass
Testing locally with the --preview flag:
$ envdir ../env.d/seasonal-flu/ python scripts/delete_records.py -db vdb -v flu_sequences --match "accession:^EPIEPI" --preview
Connected to the "vdb" database
Delete filters: {}
Delete matches: {'accession': '^EPIEPI'}
Delete intervals: {}
Preview: selection would delete 15933 records
Sources of deleted records: {'gisaid'}
One potential issue with this is the sequence accessions are added to the virus records during upload: https://github.com/nextstrain/fauna/blob/dda8186ed47a254e91ade0e87240d5a53e2f046b/vdb/upload.py#L477-L491
So even if we delete the "bad" accession sequence records, they are still listed in the virus records' "sequences" field.
The --overwrite option for flu_upload will only append new sequences with set_union.
https://github.com/nextstrain/fauna/blob/dda8186ed47a254e91ade0e87240d5a53e2f046b/vdb/upload.py#L606-L612
Functionally, I don't think this is an issue because I cannot find any script that actually uses the "sequences"/"number_sequences" fields from the virus table. It's messy data that annoys me, but I can also ignore it if it's not important to others.