GToTree icon indicating copy to clipboard operation
GToTree copied to clipboard

[Feedback] gtt-gen-SCG-HMMs vs SCG sets wiki instruction output

Open naturepoker opened this issue 3 years ago • 1 comments

Hi, I thought you might like a feedback post for the SCG generator script vs the wiki instruction.

I just finished running a test run of the gtt-gen-SCG-HMMs script versus the SCG instruction available on the wiki - it's on 49 genomes from Deinococcus-Thermus phylum, so the sample size is pretty small. The hmmsearch process took about below time on a 10 year old laptop, so it's pretty doable on student hardware.

real 85m36.039s user 317m53.994s sys 9m15.014s

I used PF34.0 while following the wiki instruction as opposed to PF32.0 - they might have switched the fields around a bit for the 34.0 version, I had to use f34 instead of f36 (cut -f 1,34 pfamA.txt > All_pfam_avg_covs.tsv) in the script to get the coverage info.

The gtt-gen-SCG-HMMs script generated a final HMM file with 200 SCG pfam accessions, and following through the manual step generated an HMM file with 193 SCG pfam accessions. All of the 193 entries were present on the 200 list. I also ran the same test on a larger dataset of Halobacteria genomes and the results were more or less similar - 216 SCG pfam accessions generated from gtt-gen-SCG-HMMs, 215 SCG pfam accessions generated from following through the manual step.

Let me know if I can provide files or other output for you- I also wrote up a more detailed description of the steps here https://naturepoker.wordpress.com/2021/06/01/writing-the-tree-single-copy-genome-set-generation-with-pfam/

Thanks for the awesome program!

naturepoker avatar Jun 01 '21 05:06 naturepoker

Stellar! Thanks so much, @naturepoker! I wonder what’s causing the small discrepancy in totals at the end, wonder if it’s as simple as me doing something like > vs >= at some point... I‘ll take a peek into this ASAP.

Thanks again for your info on your run and looking at this!

AstrobioMike avatar Jun 01 '21 05:06 AstrobioMike

Well, a lifetime later, I dug into this a bit but I wasn't able to recreate a discrepancy. My best guess is that the gtt-gen-SCG-HMMs program was pulling from a later version than PF34 of pfam at the time (since it by default takes the latest), as i was able to get different numbers only when using different PFam versions. I modified the gtt-gen-SCG-HMMs to now at least report which version of pfam is being used when it's making a new set as of version 1.8.7 👍

Thanks again for your note on this regardless (and your blog post!) and getting me to (eventually) at least add capturing the used-PFam version explicitly, @naturepoker :)

AstrobioMike avatar Sep 30 '24 01:09 AstrobioMike