serratus Genbank parser dev

The genbankParser is designed to be run as a standalone script to generate a formatted and cleaned csv table of the covid pan-genome from genbank input. It deals with most hostTaxonId mapping errors (which are plentiful), and attempts to infer these hostTaxonIds for duplicate and highly homologous entries by checking if clusters/duplicate all provide the same hostTaxonId and if so inferring it for those where none was provided (in a new column). This infers about 1500 hostTaxonIds.

Apr 27 '20 11:04 Bdegraaf1234

Didn't test, but functionality looks great, sorry again for treading on this earlier!

Apr 27 '20 14:04 rcedgar

Minor suggestion for possible future enhancement, much of the code is essentially a lookup table which would be easier to maintain as an external file in (say) tsv format, e.g.

 else if (grepl(fixed = FALSE, "Vespadelus baverstocki", NoParenth, ignore.case = T)) {
    return("unclassified Scotoecus")

Apr 27 '20 15:04 rcedgar

I fully agree, I'll try and implement it somewhere in the next few days.

Apr 27 '20 16:04 Bdegraaf1234

What is the status of this pull request? Are we waiting on a review from @ababaian ?

May 17 '20 07:05 taltman

It's good, it's definetly working but there are some features that need to be added, see the taxonomy issue.

May 17 '20 16:05 ababaian

Two questions:

Is @Bdegraaf1234 still maintaining this code?
Where is this code in the repo?

May 17 '20 22:05 taltman

Nevermind the "where" part of the question, looked at the commits.

May 17 '20 22:05 taltman

Probably not. We can either merge and close this and have someone pick up from here.

May 17 '20 22:05 ababaian

I can take a crack at the metadata file, but it sounds like @r1cedgar might be doing some of this with #101 . So we should coordinate.

May 17 '20 22:05 taltman

Parsing genbank (this issue) and uniform annotation of reference and predicted genomes (#101) are separate issues. We want both in parallel.

May 17 '20 23:05 rcedgar