ncov-ingest Pull data directly from COG-UK Data

Context

There has been a significant drop off in sequences from the UK in the NCBI data since ~April 2022 (issue was originally raised in Slack):

Description

We can update the pipeline to pull metadata and sequences directly from COG-UK Data instead of waiting on them to submit to NCBI.

We would have to use the ena_sample.secondary_accession column in their accessions TSV to drop duplicates from GenBank via the BioSample accession.

Jul 25 '22 19:07 joverlee521

We discussed a couple of options to address this during triage:

Reach out to COG-UK group via Slack to see if there are plans to continue submitting to NCBI more regularly
Add COG-UK to ingest which will require a way to ingest from metadata and sequences to NDJSON prior to applying transforms.

@joverlee521 will continue work on the latter scripts and then revisit this issue.

Jul 26 '22 19:07 huddlej

Prompted by @corneliusroemer, this is my general idea of how to switch to directly pulling data from COG-UK instead of relying on their submissions to GenBank:

Update the current patch of COG-UK data to remove all COG-UK records from the GenBank data. It will be less confusing if we make sure all COG-UK data comes from a single source instead of mix of sources. I also think this is the best way to ensure that we do not have duplicate COG-UK records.
Add a rule to fetch the COG-UK sequences. I think this should be the All sequence FASTA since we do our own alignment and masking. (We already fetch the COG-UK metadata CSV)
The COG-UK metadata CSV is formatted differently than GenBank data, so I think we can run it through its own transform pipeline with some combination of tsv-utils, csvtk, and/or the upcoming augur curate command. The produced TSV + FASTA can then be appended to the GenBank files before upload to S3.

Oct 03 '22 22:10 joverlee521

ncov-ingest ncov-ingest copied to clipboard

Pull data directly from COG-UK Data

Context

Description

ncov-ingest
ncov-ingest copied to clipboard