MACSE fails silently on large ORFs
hi there,
is there a maximum ORF length for MACSE? For one gene out of thousands in my dataset, MACSE gave me no output but also no errors. It turns out it is the Titin gene with an ORF of >30,000 amino acids (!!).
Example input seqs are human NM_001267550 and XM_028830889 (I used just the CDS seqs from those). I have attached the input file (had to add .txt on the end of the filename for github upload).
I'm using v2.07 and it has worked well with all my other input seqs. Here's my command:
java -jar -Xmx8000m macse_v2.07.jar -prog alignSequences -seq TTN_human_and_macaque.fa -out_NT TTN_human_and_macaque_NT.fa -out_AA TTN_human_and_macaque_AA.fa
It shows me this screen output:
file : TTN_human_and_macaque.fa
2 sequences with genetic code The_Standard_Code
But then stops and does not give me the usual output files.
I have tried it with two different java versions - 21.0.2 and 1.8.0_181. I'm working on #224-Ubuntu / x86_64
Thanks!
Janet Young Fred Hutch Cancer Center, Seattle, WA, USA
Hi,
Thanks for using MACSE and for taking the time to describe your issue.
The TTN gene is kind of a monster in its own category...
The memory required to use MACSE is roughly proportional to the square of the nucleotide alignment length, so handling TTN requires quite a lot of memory and also has a strong impact on the computing time.
I tried to align TTN on our (former) cluster for the last release of OrthoMam, but with more than 100 species and the huge length of TTN, it failed. I was not sure it was worth the effort since it was a single gene.
So, the issue is most probably memory-related. Your example failed silently on my laptop after a few seconds, but I could get it running with the following options to increase memory:
java -Xmx48G -Xms48G -jar ~/soft/bin/macse_v2.07.jar -prog alignSequences -seq TTN_human_and_macaque.fa.txt
I interrupted the process after few minutes because with only two sequences the alignment should be run with different frameshift/stop costs for one sequence ([explained here] (https://github.com/ranwez/MACSE_V2_PIPELINES/issues/14)).
If you have more sequences, you can keep the default costs, but the time to compute the alignment may be large. You should probably rely on MACSE option to speed up the optimization process (-optim, -max_refine_iter, -local_realign_init, etc.,) depending on your number of sequences and the time you’re ready to invest. more information in section 4. here
If you really need this alignment, cannot obtain a correct result even with these tips, and are OK to share your raw sequences, I can try to align them on our HPC (you can send me a direct email to provide sequence access).
Regardless of the reason for MACSE to fail, it should provide an error message — I apologize for that.
Hope this feedback is helpful! Do not hesitate to reach out if you have further questions or issues with MACSE, or if you need help with this specific dataset.
Sincerely,
Vincent Ranwez
hi Vincent,
Thank you - this is very helpful! It's especially helpful that you pointed me towards advice about parameters for just two sequences.
I agree that emitting an error message would be useful - it took me a minute to figure out why I didn't have the right number of output files.
For my current purpose I am very happy to ignore TTN - I am getting a genome-wide sense of divergence between two species. I could also try increasing memory (our cluster has some big nodes) but it's probably not worth it.
all the best,
Janet
Confirming - with 48G of memory this worked fine on our cluster too. Thanks!