ncov-ingest icon indicating copy to clipboard operation
ncov-ingest copied to clipboard

Pull data directly from COG-UK Data

Open joverlee521 opened this issue 3 years ago • 2 comments

Context

There has been a significant drop off in sequences from the UK in the NCBI data since ~April 2022 (issue was originally raised in Slack):

genbank-uk

Description

We can update the pipeline to pull metadata and sequences directly from COG-UK Data instead of waiting on them to submit to NCBI.

We would have to use the ena_sample.secondary_accession column in their accessions TSV to drop duplicates from GenBank via the BioSample accession.

joverlee521 avatar Jul 25 '22 19:07 joverlee521

We discussed a couple of options to address this during triage:

  1. Reach out to COG-UK group via Slack to see if there are plans to continue submitting to NCBI more regularly
  2. Add COG-UK to ingest which will require a way to ingest from metadata and sequences to NDJSON prior to applying transforms.

@joverlee521 will continue work on the latter scripts and then revisit this issue.

huddlej avatar Jul 26 '22 19:07 huddlej

Prompted by @corneliusroemer, this is my general idea of how to switch to directly pulling data from COG-UK instead of relying on their submissions to GenBank:

  1. Update the current patch of COG-UK data to remove all COG-UK records from the GenBank data. It will be less confusing if we make sure all COG-UK data comes from a single source instead of mix of sources. I also think this is the best way to ensure that we do not have duplicate COG-UK records.

  2. Add a rule to fetch the COG-UK sequences. I think this should be the All sequence FASTA since we do our own alignment and masking. (We already fetch the COG-UK metadata CSV)

  3. The COG-UK metadata CSV is formatted differently than GenBank data, so I think we can run it through its own transform pipeline with some combination of tsv-utils, csvtk, and/or the upcoming augur curate command. The produced TSV + FASTA can then be appended to the GenBank files before upload to S3.

joverlee521 avatar Oct 03 '22 22:10 joverlee521