Genbank parser dev
The genbankParser is designed to be run as a standalone script to generate a formatted and cleaned csv table of the covid pan-genome from genbank input. It deals with most hostTaxonId mapping errors (which are plentiful), and attempts to infer these hostTaxonIds for duplicate and highly homologous entries by checking if clusters/duplicate all provide the same hostTaxonId and if so inferring it for those where none was provided (in a new column). This infers about 1500 hostTaxonIds.
Didn't test, but functionality looks great, sorry again for treading on this earlier!
Minor suggestion for possible future enhancement, much of the code is essentially a lookup table which would be easier to maintain as an external file in (say) tsv format, e.g.
else if (grepl(fixed = FALSE, "Vespadelus baverstocki", NoParenth, ignore.case = T)) {
return("unclassified Scotoecus")
I fully agree, I'll try and implement it somewhere in the next few days.
What is the status of this pull request? Are we waiting on a review from @ababaian ?
It's good, it's definetly working but there are some features that need to be added, see the taxonomy issue.
Two questions:
- Is @Bdegraaf1234 still maintaining this code?
- Where is this code in the repo?
Nevermind the "where" part of the question, looked at the commits.
Probably not. We can either merge and close this and have someone pick up from here.
I can take a crack at the metadata file, but it sounds like @r1cedgar might be doing some of this with #101 . So we should coordinate.
Parsing genbank (this issue) and uniform annotation of reference and predicted genomes (#101) are separate issues. We want both in parallel.