ncov-ingest
ncov-ingest copied to clipboard
Pull data directly from COG-UK Data
Context
There has been a significant drop off in sequences from the UK in the NCBI data since ~April 2022 (issue was originally raised in Slack):
Description
We can update the pipeline to pull metadata and sequences directly from COG-UK Data instead of waiting on them to submit to NCBI.
We would have to use the ena_sample.secondary_accession column in their accessions TSV to drop duplicates from GenBank via the BioSample accession.
We discussed a couple of options to address this during triage:
- Reach out to COG-UK group via Slack to see if there are plans to continue submitting to NCBI more regularly
- Add COG-UK to ingest which will require a way to ingest from metadata and sequences to NDJSON prior to applying transforms.
@joverlee521 will continue work on the latter scripts and then revisit this issue.
Prompted by @corneliusroemer, this is my general idea of how to switch to directly pulling data from COG-UK instead of relying on their submissions to GenBank:
-
Update the current patch of COG-UK data to remove all COG-UK records from the GenBank data. It will be less confusing if we make sure all COG-UK data comes from a single source instead of mix of sources. I also think this is the best way to ensure that we do not have duplicate COG-UK records.
-
Add a rule to fetch the COG-UK sequences. I think this should be the All sequence FASTA since we do our own alignment and masking. (We already fetch the COG-UK metadata CSV)
-
The COG-UK metadata CSV is formatted differently than GenBank data, so I think we can run it through its own transform pipeline with some combination of
tsv-utils,csvtk, and/or the upcoming augur curate command. The produced TSV + FASTA can then be appended to the GenBank files before upload to S3.