PPanGGOLiN
PPanGGOLiN copied to clipboard
Getting MSAs for single-copy gene families when duplicates are tolerated
I'm looking at the code for writing FastA files for computing MSAs here. I'm trying to understand what happens to the multi-copy members of a genome in single-copy families defined when multiple copies are tolerated. Looking at the code, it seems like such multi-copy members are not printed to FastA at all; so, they appear to be lost if only looking at the output FastA file. Could you please confirm this? Sorry if I misunderstand something. I appreciate that it could get tricky to try to figure out whether the multi-copy members are fragments of the same gene. Thank you.
Hi,
I can confirm indeed this is what is done here. Taking into account genes identified as fragments differently if they align on different parts of the "complete" genes could be an improvement of this function indeed !
Adelme
Thanks, Adelme.
A related matter. I noticed that the parameter description for dup_margin
here doesn't seem quite right.
I think that it should be the same as the parameter description here.
Also, isn't a default value of 0.95
for dup_margin
a bit high for determining if a gene family is single-copy or not?
Indeed, for the description. And actually the 2nd function that you pointed should probably be used instead of the code in use currently in writeMSA.
And indeed the default value is way off but thankfully it should never be used and the command line default, at a much more adapted value of 0.05
seems to be in use in all of the function calls.
Both of those do deserve a fix, thank you for reporting this.