serratus icon indicating copy to clipboard operation
serratus copied to clipboard

Genbank parser dev

Open Bdegraaf1234 opened this issue 5 years ago • 10 comments

The genbankParser is designed to be run as a standalone script to generate a formatted and cleaned csv table of the covid pan-genome from genbank input. It deals with most hostTaxonId mapping errors (which are plentiful), and attempts to infer these hostTaxonIds for duplicate and highly homologous entries by checking if clusters/duplicate all provide the same hostTaxonId and if so inferring it for those where none was provided (in a new column). This infers about 1500 hostTaxonIds.

Bdegraaf1234 avatar Apr 27 '20 11:04 Bdegraaf1234

Didn't test, but functionality looks great, sorry again for treading on this earlier!

rcedgar avatar Apr 27 '20 14:04 rcedgar

Minor suggestion for possible future enhancement, much of the code is essentially a lookup table which would be easier to maintain as an external file in (say) tsv format, e.g.

 else if (grepl(fixed = FALSE, "Vespadelus baverstocki", NoParenth, ignore.case = T)) {
    return("unclassified Scotoecus") 

rcedgar avatar Apr 27 '20 15:04 rcedgar

I fully agree, I'll try and implement it somewhere in the next few days.

Bdegraaf1234 avatar Apr 27 '20 16:04 Bdegraaf1234

What is the status of this pull request? Are we waiting on a review from @ababaian ?

taltman avatar May 17 '20 07:05 taltman

It's good, it's definetly working but there are some features that need to be added, see the taxonomy issue.

ababaian avatar May 17 '20 16:05 ababaian

Two questions:

  • Is @Bdegraaf1234 still maintaining this code?
  • Where is this code in the repo?

taltman avatar May 17 '20 22:05 taltman

Nevermind the "where" part of the question, looked at the commits.

taltman avatar May 17 '20 22:05 taltman

Probably not. We can either merge and close this and have someone pick up from here.

ababaian avatar May 17 '20 22:05 ababaian

I can take a crack at the metadata file, but it sounds like @r1cedgar might be doing some of this with #101 . So we should coordinate.

taltman avatar May 17 '20 22:05 taltman

Parsing genbank (this issue) and uniform annotation of reference and predicted genomes (#101) are separate issues. We want both in parallel.

rcedgar avatar May 17 '20 23:05 rcedgar