muscle icon indicating copy to clipboard operation
muscle copied to clipboard

Error when using -super5 with Muscle3D

Open pentamorfico opened this issue 1 year ago • 7 comments

Hi Robert!

First of all, thanks you so much for your time dedicated to software development and making bioinformaticians lives easier!

I'm trying to aling around 90k PDBs from Alphafold using Muscle 3D, following this commands in a machine with 2TB of RAM and 128 threads.

reseek -pdb2mega second_round/ -output second_round.mega && muscle -super5 second_round.mega -output second_round.afa

However, I always get the following error:

Mega::GetProfileByLabel(Cluster2) with different cluster numbers depending on the run.

Is -super5 compatible with Muscle3D? When I run -align with >1k sequences, I get the warning >1k sequences, may be slow or use excessive memory, consider using -super5

I also tried with smaller alignments (~200 PDBs) and I also get the same error. When I try with -align instead of -super5 everything works nice!

Is it advisable to run Muscle 3D with >90k sequences without the -super5 command? What should be the best strategy for this?

Thanks for your time, Mario

pentamorfico avatar Nov 21 '24 18:11 pentamorfico

-super5 is for aa, for structures use -super7 which is briefly documented in the repo README as follows:

# for up to ~10,000 structures
reseek -convert STRUCTS -bca structs.bca
reseek -pdb2mega structs.bca -output structs.mega
reseek -distmx structs.bca -output structs.distmx
muscle -super7 structs.mega -distmxin structs.distmx -reseek -output structs.afa

I haven't tried 90k structures, I think a good chance it will work though the alignment might be better if you cluster the structures first. Reseek has an undocumented clustering command -- if you want to give that a try let me know & I'll sketch out how to use it.

The usage message given by muscle does not explain this and the documentation at the web site does not mention structure at all yet -- the documentation could certainly be improved here.

If you find the 90k alignment useful, I'd be interested to learn more, maybe you could email me?

rcedgar avatar Nov 21 '24 18:11 rcedgar

Hey @rcedgar, thank you so much for the super-fast reply!

Sorry, I completely missed that part of the README. I think I was too excited to try it out and jumped straight to launching it!

I will try again and let you know if it worked. For now, I'm interested in having the 90k structure-based alignment to compare it with a sequence-based alignment. If that doesn't work, I will try with reseek clustering first. I already compared Foldmason vs Muscle3D with a smaller set and Muscle worked way better for me.

I will definitely send you an email with more information about the project in case you are interested!

Best, Mario

pentamorfico avatar Nov 21 '24 18:11 pentamorfico

Hi @rcedgar I tried again following these commands:

reseek -convert second_round/ -bca second_round.bca reseek -pdb2mega second_round.bca -output second_round.mega reseek -distmx second_round.bca -output second_round.distmx -verysensitive muscle -super7 second_round.mega -distmxin second_round.distmx -reseek -output second_round.afa

And I get the following error:

---Fatal error--- Distance matrix too sparse

pentamorfico avatar Nov 22 '24 09:11 pentamorfico

This means that the structures are very divergent, I think it's unlikely you will be able to make a meaningful MStA here. Happy to discuss further if you send me an email.

rcedgar avatar Nov 22 '24 14:11 rcedgar

Hi @rcedgar, I've checked and the distance matrix second_round.distmx is completely empty, it is only showing the IDs of the proteins. I will investigate more in depth what can be causing this issue.

I've sent you a mail with some more info!

pentamorfico avatar Nov 27 '24 17:11 pentamorfico

This seems to be an issue with the conda build, close as resolved?

rcedgar avatar Nov 28 '24 16:11 rcedgar

Yes, it was indeed an issue with the conda build. The same happened to other users

Just as a comment in case anyone is facing the same, the latest version of reseek (v2.6.1) produces the same issue:

Running reseek -distmx structs.bca -output structs.distmx will produce an empty distance matrix

rcedgar/reseek#16

pentamorfico avatar Aug 15 '25 22:08 pentamorfico