Memcore error on _some_ large genomes
Hi,
I'm having an issue running compleasm (I guess it's really miniprot) on the following genomes:
https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_964261635.1/ https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_964204655.1/ https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_964263255.1/
They are all newts with genomes around 23/24Gb in size and have ok contiguity N50~6-8Mb
The strange thing is I have run on similar, or even bigger genomes without issue, e.g.:
https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_026652325.1/ https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_040939525.1/ https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_040938575.1/
My solution at the moment has been to split into contigs. I first tried breaking the >2Gb scaffolds into 1Gb pieces, but this failed with the same error.
Do you have any ideas? I notice that the miniprot included in compleasm_kit is v0.12. Was there any update for large chromsomes/genomes in v0.13?
Here is the error message for completeness:
[M::[email protected]*0.81] 189282232 blocks
[morecore] insufficient memory
Searching for miniprot in the path where compleasm.py is located
Searching for hmmsearch in the path where compleasm.py is located
miniprot execute command:
/scratch/brown/progs/compleasm_kit/miniprot
lineage: vertebrata_odb10
Traceback (most recent call last):
File "/scratch/brown/progs/compleasm_kit/compleasm.py", line 2748, in <module>
main()
~~~~^^
File "/scratch/brown/progs/compleasm_kit/compleasm.py", line 2744, in main
args.func(args)
~~~~~~~~~^^^^^^
File "/scratch/brown/progs/compleasm_kit/compleasm.py", line 2624, in run
mr.Run()
~~~~~~^^
File "/scratch/brown/progs/compleasm_kit/compleasm.py", line 2147, in Run
miniprot_output_path = self.miniprot_runner.run_miniprot(self.assembly_path,
lineage_filepath,
alignment_output_dir)
File "/scratch/brown/progs/compleasm_kit/compleasm.py", line 308, in run_miniprot
raise Exception("miniprot exited with non-zero exit code: {}".format(exitcode))
Exception: miniprot exited with non-zero exit code: -6```
Hi,
I think for large genomes miniprot still has issue. One potential solution is you can do the mapping with latest miniprot like miniprot --trans -u -I --outs=0.95 --gff -t 8 ref-file protein.faa > output.gff. And then you use compleasm analyze module to process the miniprot alignment result.
unfortunately that didn't work either. Do you know if there is a way to create the index separately? I tried playing around with the kmer size and number of bits, but this still resulted in a memcore error, but different memory usages.
If you have already broken long chromosomes, I am not sure the crash was caused by the long-sequence bug in miniprot. It could be due to insufficient memory as miniprot may use memory ~10 times more than the input genome during indexing; alignment takes less memory.
Anyway, try miniprot from github HEAD. It is supposed to work with >2Gb chromosomes. Let me know how that works. Thanks.
this worked perfectly, thank you
@tbrown91 VGP folks were talking about not classifying retrocopies as duplicates two weeks ago. Neng added the feature and released v0.2.7. With option --retrocopy, compleasm reports a new class "R" for a gene having one multi-exon copy and ≥1 single-exon copies. We haven't extensively tested the feature. Let us know if you are interested. Also note that v0.2.7 only works with the latest BUSCO database odb12.
Thanks @lh3. Once I have the list of genomes, I would like to include this when generating compleasm scores for every genome. Do you know if it dramatically increases runtime/memory or affects the way the other classes are scored?
Hi @tbrown91,
Retrocopy will only affect the score of the duplicated genes and will not affect the scores of other gene classes. However, due to the difference between odb12 and odb10 dataset, the logic of calculating the score of each class has changed a little bit. The results of the latest version will be slightly different from those of the previous version. Runtime/memory is not affected.
Thanks @huangnengCSU I have tested the retrocopy mode on the VGP genomes. I'll try and present it at some point at one of the Friday meetings.
I have a small feature request for the future as well to output the locations of the detected genes in gff format. I think all of the necessary information is there in the full_table.tsv, where each CDS could be extracted from the "Codons" column.
Thank you for being so responsive to my questions and gripes, I really appreciate it. I'm happy for you to close this "Issue" to tidy up your github if you would like.
Hi @tbrown91,
I have added the gff output of detected genes. You can try it by the source code without additional argument.