Error when using -super5 with Muscle3D
Hi Robert!
First of all, thanks you so much for your time dedicated to software development and making bioinformaticians lives easier!
I'm trying to aling around 90k PDBs from Alphafold using Muscle 3D, following this commands in a machine with 2TB of RAM and 128 threads.
reseek -pdb2mega second_round/ -output second_round.mega && muscle -super5 second_round.mega -output second_round.afa
However, I always get the following error:
Mega::GetProfileByLabel(Cluster2) with different cluster numbers depending on the run.
Is -super5 compatible with Muscle3D? When I run -align with >1k sequences, I get the warning >1k sequences, may be slow or use excessive memory, consider using -super5
I also tried with smaller alignments (~200 PDBs) and I also get the same error. When I try with -align instead of -super5 everything works nice!
Is it advisable to run Muscle 3D with >90k sequences without the -super5 command? What should be the best strategy for this?
Thanks for your time, Mario
-super5 is for aa, for structures use -super7 which is briefly documented in the repo README as follows:
# for up to ~10,000 structures reseek -convert STRUCTS -bca structs.bca reseek -pdb2mega structs.bca -output structs.mega reseek -distmx structs.bca -output structs.distmx muscle -super7 structs.mega -distmxin structs.distmx -reseek -output structs.afa
I haven't tried 90k structures, I think a good chance it will work though the alignment might be better if you cluster the structures first. Reseek has an undocumented clustering command -- if you want to give that a try let me know & I'll sketch out how to use it.
The usage message given by muscle does not explain this and the documentation at the web site does not mention structure at all yet -- the documentation could certainly be improved here.
If you find the 90k alignment useful, I'd be interested to learn more, maybe you could email me?
Hey @rcedgar, thank you so much for the super-fast reply!
Sorry, I completely missed that part of the README. I think I was too excited to try it out and jumped straight to launching it!
I will try again and let you know if it worked. For now, I'm interested in having the 90k structure-based alignment to compare it with a sequence-based alignment. If that doesn't work, I will try with reseek clustering first. I already compared Foldmason vs Muscle3D with a smaller set and Muscle worked way better for me.
I will definitely send you an email with more information about the project in case you are interested!
Best, Mario
Hi @rcedgar I tried again following these commands:
reseek -convert second_round/ -bca second_round.bca
reseek -pdb2mega second_round.bca -output second_round.mega
reseek -distmx second_round.bca -output second_round.distmx -verysensitive
muscle -super7 second_round.mega -distmxin second_round.distmx -reseek -output second_round.afa
And I get the following error:
---Fatal error--- Distance matrix too sparse
This means that the structures are very divergent, I think it's unlikely you will be able to make a meaningful MStA here. Happy to discuss further if you send me an email.
Hi @rcedgar, I've checked and the distance matrix second_round.distmx is completely empty, it is only showing the IDs of the proteins. I will investigate more in depth what can be causing this issue.
I've sent you a mail with some more info!
This seems to be an issue with the conda build, close as resolved?
Yes, it was indeed an issue with the conda build. The same happened to other users
Just as a comment in case anyone is facing the same, the latest version of reseek (v2.6.1) produces the same issue:
Running reseek -distmx structs.bca -output structs.distmx will produce an empty distance matrix
rcedgar/reseek#16