compleasm icon indicating copy to clipboard operation
compleasm copied to clipboard

Memcore error on _some_ large genomes

Open tbrown91 opened this issue 10 months ago • 9 comments

Hi,

I'm having an issue running compleasm (I guess it's really miniprot) on the following genomes:

https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_964261635.1/ https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_964204655.1/ https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_964263255.1/

They are all newts with genomes around 23/24Gb in size and have ok contiguity N50~6-8Mb

The strange thing is I have run on similar, or even bigger genomes without issue, e.g.:

https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_026652325.1/ https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_040939525.1/ https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_040938575.1/

My solution at the moment has been to split into contigs. I first tried breaking the >2Gb scaffolds into 1Gb pieces, but this failed with the same error.

Do you have any ideas? I notice that the miniprot included in compleasm_kit is v0.12. Was there any update for large chromsomes/genomes in v0.13?

Here is the error message for completeness:

[M::[email protected]*0.81] 189282232 blocks
[morecore] insufficient memory
Searching for miniprot in the path where compleasm.py is located
Searching for hmmsearch in the path where compleasm.py is located
miniprot execute command:
 /scratch/brown/progs/compleasm_kit/miniprot
lineage: vertebrata_odb10
Traceback (most recent call last):
  File "/scratch/brown/progs/compleasm_kit/compleasm.py", line 2748, in <module>
    main()
    ~~~~^^
  File "/scratch/brown/progs/compleasm_kit/compleasm.py", line 2744, in main
    args.func(args)
    ~~~~~~~~~^^^^^^
  File "/scratch/brown/progs/compleasm_kit/compleasm.py", line 2624, in run
    mr.Run()
    ~~~~~~^^
  File "/scratch/brown/progs/compleasm_kit/compleasm.py", line 2147, in Run
    miniprot_output_path = self.miniprot_runner.run_miniprot(self.assembly_path,
                                                             lineage_filepath,
                                                             alignment_output_dir)
  File "/scratch/brown/progs/compleasm_kit/compleasm.py", line 308, in run_miniprot
    raise Exception("miniprot exited with non-zero exit code: {}".format(exitcode))
Exception: miniprot exited with non-zero exit code: -6```

tbrown91 avatar Feb 27 '25 10:02 tbrown91

Hi, I think for large genomes miniprot still has issue. One potential solution is you can do the mapping with latest miniprot like miniprot --trans -u -I --outs=0.95 --gff -t 8 ref-file protein.faa > output.gff. And then you use compleasm analyze module to process the miniprot alignment result.

huangnengCSU avatar Feb 27 '25 15:02 huangnengCSU

unfortunately that didn't work either. Do you know if there is a way to create the index separately? I tried playing around with the kmer size and number of bits, but this still resulted in a memcore error, but different memory usages.

tbrown91 avatar Mar 03 '25 09:03 tbrown91

If you have already broken long chromosomes, I am not sure the crash was caused by the long-sequence bug in miniprot. It could be due to insufficient memory as miniprot may use memory ~10 times more than the input genome during indexing; alignment takes less memory.

Anyway, try miniprot from github HEAD. It is supposed to work with >2Gb chromosomes. Let me know how that works. Thanks.

lh3 avatar Mar 07 '25 21:03 lh3

this worked perfectly, thank you

tbrown91 avatar Mar 10 '25 14:03 tbrown91

@tbrown91 VGP folks were talking about not classifying retrocopies as duplicates two weeks ago. Neng added the feature and released v0.2.7. With option --retrocopy, compleasm reports a new class "R" for a gene having one multi-exon copy and ≥1 single-exon copies. We haven't extensively tested the feature. Let us know if you are interested. Also note that v0.2.7 only works with the latest BUSCO database odb12.

lh3 avatar Mar 18 '25 22:03 lh3

Thanks @lh3. Once I have the list of genomes, I would like to include this when generating compleasm scores for every genome. Do you know if it dramatically increases runtime/memory or affects the way the other classes are scored?

tbrown91 avatar Mar 21 '25 16:03 tbrown91

Hi @tbrown91,

Retrocopy will only affect the score of the duplicated genes and will not affect the scores of other gene classes. However, due to the difference between odb12 and odb10 dataset, the logic of calculating the score of each class has changed a little bit. The results of the latest version will be slightly different from those of the previous version. Runtime/memory is not affected.

huangnengCSU avatar Mar 21 '25 16:03 huangnengCSU

Thanks @huangnengCSU I have tested the retrocopy mode on the VGP genomes. I'll try and present it at some point at one of the Friday meetings.

I have a small feature request for the future as well to output the locations of the detected genes in gff format. I think all of the necessary information is there in the full_table.tsv, where each CDS could be extracted from the "Codons" column.

Thank you for being so responsive to my questions and gripes, I really appreciate it. I'm happy for you to close this "Issue" to tidy up your github if you would like.

tbrown91 avatar Apr 22 '25 09:04 tbrown91

Hi @tbrown91,

I have added the gff output of detected genes. You can try it by the source code without additional argument.

huangnengCSU avatar May 05 '25 04:05 huangnengCSU