GetOrganelle icon indicating copy to clipboard operation
GetOrganelle copied to clipboard

unable to assemble Mitochondria genome for plant

Open amit4mchiba opened this issue 2 years ago • 4 comments

Dear Sir,

I am trying to assemble Mitochondria genome for my plant, and used this tool. I used paired-end dataset, and despite running for sometime, it ended with following message-

GetOrganelle v1.7.5.3

get_organelle_from_reads.py assembles organelle genomes from genome skimming data. Find updates in https://github.com/Kinggerm/GetOrganelle and see README.md for more information.

Python 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] PLATFORM: Linux amit8riken 5.13.0-28-generic #31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022 x86_64 x86_64 PYTHON LIBS: GetOrganelleLib 1.7.5.3; numpy 1.22.3; sympy 1.10.1; scipy 1.8.0; psutil 5.9.0 DEPENDENCIES: Bowtie2 2.4.5; SPAdes 3.13.0; Blast 2.12.0 GETORG_PATH=/home/amit8riken/.GetOrganelle SEED DB: embplant_mt 0.0.1; embplant_pt 0.0.1 LABEL DB: embplant_mt 0.0.1; embplant_pt 0.0.1 WORKING DIR: /mnt/HD1/Planrt_HiC_juicer /mnt/HD1/miniconda3/bin/get_organelle_from_reads.py -1 /mnt/HD1/Planrt_HiC_juicer/Planrt_genome_related/Plant_ill_paired_R1.fq.gz -2 /mnt/HD1/Planrt_HiC_juicer/Planrt_genome_related/Plant_ill_paired_R2.fq.gz -t 32 -o Plant_mitochrondia_output -R 50 -P 1000000 -k 21,65,85,105,115 -F embplant_mt

2022-04-05 08:54:01,586 - INFO: Pre-reading fastq ... 2022-04-05 08:54:01,587 - INFO: Estimating reads to use ... (to use all reads, set '--reduce-reads-for-coverage inf --max-reads inf') 2022-04-05 08:54:01,664 - INFO: Tasting 100000+100000 reads ... 2022-04-05 08:54:11,126 - INFO: Tasting 500000+500000 reads ... 2022-04-05 08:54:25,081 - INFO: Tasting 2500000+2500000 reads ... 2022-04-05 08:55:20,216 - INFO: Estimating reads to use finished. 2022-04-05 08:55:20,216 - INFO: Unzipping reads file: /mnt/HD1/Planrt_HiC_juicer/Planrt_genome_related/Plant_ill_paired_R1.fq.gz (57941448889 bytes) 2022-04-05 08:59:19,717 - INFO: Unzipping reads file: /mnt/HD1/Planrt_HiC_juicer/Planrt_genome_related/Plant_ill_paired_R2.fq.gz (62671269531 bytes) 2022-04-05 09:03:26,521 - INFO: Counting read qualities ... 2022-04-05 09:03:26,709 - INFO: Identified quality encoding format = Sanger 2022-04-05 09:03:26,709 - INFO: Phred offset = 33 2022-04-05 09:03:26,710 - INFO: Trimming bases with qualities (0.00%): 33..33 ! 2022-04-05 09:03:26,815 - INFO: Mean error rate = 0.0011 2022-04-05 09:03:26,816 - INFO: Counting read lengths ... 2022-04-05 09:06:15,479 - INFO: Mean = 243.0 bp, maximum = 250 bp. 2022-04-05 09:06:15,480 - INFO: Reads used = 75000000+75000000 2022-04-05 09:06:15,480 - INFO: Pre-reading fastq finished.

2022-04-05 09:06:15,480 - INFO: Making seed reads ... 2022-04-05 09:06:15,480 - INFO: Seed bowtie2 index existed! 2022-04-05 09:06:15,480 - INFO: Mapping reads to seed bowtie2 index ... 2022-04-05 09:23:42,553 - INFO: Mapping finished. 2022-04-05 09:23:42,553 - INFO: Seed reads made: Plant_mitochrondia_output/seed/embplant_mt.initial.fq (97248987 bytes) 2022-04-05 09:23:42,553 - INFO: Making seed reads finished.

2022-04-05 09:23:42,553 - INFO: Checking seed reads and parameters ... 2022-04-05 09:23:42,553 - INFO: The automatically-estimated parameter(s) do not ensure the best choice(s). 2022-04-05 09:23:42,553 - INFO: If the result graph is not a circular organelle genome, 2022-04-05 09:23:42,554 - INFO: you could adjust the value(s) of '-w'/'-R' for another new run. 2022-04-05 09:23:51,598 - INFO: Pre-assembling mapped reads ... 2022-04-05 09:24:00,890 - INFO: Pre-assembling mapped reads finished. 2022-04-05 09:24:00,891 - INFO: Estimated embplant_mt-hitting base-coverage = 262.96 2022-04-05 09:24:01,194 - INFO: Estimated word size(s): 182 2022-04-05 09:24:01,194 - INFO: Setting '-w 182' 2022-04-05 09:24:01,194 - INFO: Setting '--max-extending-len inf' 2022-04-05 09:24:01,468 - INFO: Checking seed reads and parameters finished.

2022-04-05 09:24:01,468 - INFO: Making read index ... 2022-04-05 09:41:30,393 - INFO: Mem 19.423 G, 140375132 candidates in all 150000000 reads 2022-04-05 09:41:30,393 - INFO: Pre-grouping reads ... 2022-04-05 09:41:30,393 - INFO: Setting '--pre-w 182' 2022-04-05 09:41:42,525 - INFO: Mem 19.462 G, 1000000/1801517 used/duplicated 2022-04-05 09:43:21,024 - INFO: Mem 23.609 G, 9551 groups made. 2022-04-05 09:43:32,086 - INFO: Making read index finished.

2022-04-05 09:43:32,087 - INFO: Extending ... 2022-04-05 09:43:32,087 - INFO: Adding initial words ... 2022-04-05 09:43:40,929 - INFO: AW 2334420 2022-04-05 10:06:31,843 - INFO: Round 1: 140375132/140375132 AI 3542144 AW 41665599 Mem 13.791 2022-04-05 10:37:54,472 - INFO: Round 2: 140375132/140375132 AI 13780016 AW 250916319 Mem 70.739 2022-04-05 11:08:15,300 - INFO: Round 3: 140375132/140375132 AI 24038205 AW 467503100 Mem 131.687 2022-04-05 11:34:35,412 - INFO: Round 4: 140375132/140375132 AI 29614451 AW 612356279 Mem 166.438 2022-04-05 11:58:50,123 - INFO: Round 5: 140375132/140375132 AI 33361451 AW 715776035 Mem 207.208 2022-04-05 12:21:18,711 - INFO: Round 6: 140375132/140375132 AI 36073641 AW 792925311 Mem 225.684 2022-04-05 12:43:09,339 - INFO: Round 7: 140375132/140375132 AI 38211205 AW 853555304 Mem 240.204 2022-04-05 13:04:26,094 - INFO: Round 8: 140375132/140375132 AI 39906414 AW 901552173 Mem 251.698 2022-04-05 13:25:21,373 - INFO: Round 9: 140375132/140375132 AI 41265788 AW 940384377 Mem 262.998 2022-04-05 13:46:09,032 - INFO: Round 10: 140375132/140375132 AI 42366998 AW 972083769 Mem 270.589 2022-04-05 14:06:38,300 - INFO: Round 11: 140375132/140375132 AI 43294165 AW 998624565 Mem 276.945 2022-04-05 14:26:57,499 - INFO: Round 12: 140375132/140375132 AI 44071700 AW 1020699697 Mem 282.232 2022-04-05 14:47:03,658 - INFO: Round 13: 140375132/140375132 AI 44714205 AW 1038932550 Mem 286.598 2022-04-05 15:07:00,164 - INFO: Round 14: 140375132/140375132 AI 45249884 AW 1054183042 Mem 290.25 2022-04-05 15:26:52,666 - INFO: Round 15: 140375132/140375132 AI 45706315 AW 1067029761 Mem 293.327 2022-04-05 15:46:39,385 - INFO: Round 16: 140375132/140375132 AI 46096296 AW 1077881805 Mem 295.973 2022-04-05 16:06:17,343 - INFO: Round 17: 140375132/140375132 AI 46417972 AW 1086885145 Mem 298.129 2022-04-05 16:25:51,320 - INFO: Round 18: 140375132/140375132 AI 46680487 AW 1094346943 Mem 299.916 2022-04-05 16:45:23,382 - INFO: Round 19: 140375132/140375132 AI 46905586 AW 1100702949 Mem 301.437 2022-04-05 17:04:49,515 - INFO: Round 20: 140375132/140375132 AI 47099903 AW 1106153781 Mem 302.743 2022-04-05 17:24:18,497 - INFO: Round 21: 140375132/140375132 AI 47265246 AW 1110789265 Mem 303.853 2022-04-05 17:43:46,256 - INFO: Round 22: 140375132/140375132 AI 47408678 AW 1114791655 Mem 304.812 2022-04-05 18:03:13,916 - INFO: Round 23: 140375132/140375132 AI 47535144 AW 1118304371 Mem 305.653 2022-04-05 18:22:41,898 - INFO: Round 24: 140375132/140375132 AI 47654202 AW 1121376481 Mem 306.389 2022-04-05 18:42:07,398 - INFO: Round 25: 140375132/140375132 AI 47755707 AW 1124032621 Mem 307.025 2022-04-05 19:01:30,718 - INFO: Round 26: 140375132/140375132 AI 47839948 AW 1126322919 Mem 307.574 2022-04-05 19:21:01,210 - INFO: Round 27: 140375132/140375132 AI 47910006 AW 1128257553 Mem 308.037 2022-04-05 19:40:29,485 - INFO: Round 28: 140375132/140375132 AI 47973193 AW 1129971213 Mem 308.448 2022-04-05 19:59:49,645 - INFO: Round 29: 140375132/140375132 AI 48028972 AW 1131480343 Mem 308.809 2022-04-05 20:19:08,517 - INFO: Round 30: 140375132/140375132 AI 48077714 AW 1132796447 Mem 309.124 2022-04-05 20:38:25,056 - INFO: Round 31: 140375132/140375132 AI 48119951 AW 1133943897 Mem 309.399 2022-04-05 20:57:40,610 - INFO: Round 32: 140375132/140375132 AI 48163382 AW 1135038245 Mem 309.661 2022-04-05 21:16:55,150 - INFO: Round 33: 140375132/140375132 AI 48198085 AW 1135945535 Mem 309.879 2022-04-05 21:36:13,553 - INFO: Round 34: 140375132/140375132 AI 48226392 AW 1136720047 Mem 310.064 2022-04-05 21:56:19,156 - INFO: Round 35: 140375132/140375132 AI 48252608 AW 1137422393 Mem 310.233 2022-04-05 22:15:30,537 - INFO: Round 36: 140375132/140375132 AI 48275331 AW 1138039187 Mem 310.38 2022-04-05 22:34:50,205 - INFO: Round 37: 140375132/140375132 AI 48295567 AW 1138582801 Mem 310.511 2022-04-05 22:54:09,040 - INFO: Round 38: 140375132/140375132 AI 48313894 AW 1139061003 Mem 310.625 2022-04-05 23:13:29,311 - INFO: Round 39: 140375132/140375132 AI 48329965 AW 1139482859 Mem 310.726 2022-04-05 23:32:50,933 - INFO: Round 40: 140375132/140375132 AI 48343519 AW 1139846421 Mem 310.813 2022-04-05 23:52:09,742 - INFO: Round 41: 140375132/140375132 AI 48355278 AW 1140162337 Mem 310.889 2022-04-06 00:11:28,310 - INFO: Round 42: 140375132/140375132 AI 48365117 AW 1140425177 Mem 310.952 2022-04-06 00:30:46,490 - INFO: Round 43: 140375132/140375132 AI 48374326 AW 1140661829 Mem 311.009 2022-04-06 00:50:03,379 - INFO: Round 44: 140375132/140375132 AI 48382714 AW 1140877253 Mem 311.06 2022-04-06 01:09:20,202 - INFO: Round 45: 140375132/140375132 AI 48390386 AW 1141074011 Mem 311.107 2022-04-06 01:28:38,593 - INFO: Round 46: 140375132/140375132 AI 48397155 AW 1141248209 Mem 311.149 2022-04-06 01:47:55,646 - INFO: Round 47: 140375132/140375132 AI 48403055 AW 1141400589 Mem 311.186 2022-04-06 02:07:15,718 - INFO: Round 48: 140375132/140375132 AI 48408252 AW 1141539029 Mem 311.219 2022-04-06 02:26:32,805 - INFO: Round 49: 140375132/140375132 AI 48412914 AW 1141656851 Mem 311.247 2022-04-06 02:45:50,889 - INFO: Round 50: 140375132/140375132 AI 48416782 AW 1141758977 Mem 311.271 2022-04-06 02:45:50,890 - INFO: Hit the round limit 50 and terminated ... 2022-04-06 02:57:59,610 - INFO: Extending finished.

2022-04-06 02:58:07,074 - INFO: Separating extended fastq file ... 2022-04-06 02:58:09,189 - ERROR: Traceback (most recent call last): File "/mnt/HD1/miniconda3/bin/get_organelle_from_reads.py", line 4072, in main reads_paired['pair_out'] = separate_fq_by_pair(out_base, options.prefix, verb_log, log_handler) File "/mnt/HD1/miniconda3/bin/get_organelle_from_reads.py", line 3327, in separate_fq_by_pair get_paired_and_unpaired_reads(input_fq_1=os.path.join(out_base, prefix + "extended_1.fq"), File "/mnt/HD1/miniconda3/lib/python3.9/site-packages/GetOrganelleLib/seq_parser.py", line 1649, in get_paired_and_unpaired_reads file_h_1_lines = open(input_fq_1, 'r').readlines() MemoryError

Total cost 65049.06 s For trouble-shooting, please Firstly, check https://github.com/Kinggerm/GetOrganelle/wiki/FAQ Secondly, check if there are open/closed issues related at https://github.com/Kinggerm/GetOrganelle/issues If your problem was still not solved, please open an issue at https://github.com/Kinggerm/GetOrganelle/issues please provide the get_org.log.txt and the assembly graph (can be *.png to protect your data privacy) if possible!

I used this command- get_organelle_from_reads.py -1 /mnt/HD1/Planrt_HiC_juicer/Planrt_genome_related/Plant_ill_paired_R1.fq.gz -2 /mnt/HD1/Planrt_HiC_juicer/Planrt_genome_related/Plant_ill_paired_R2.fq.gz -t 32 -o Plant_mitochrondia_output -R 50 -P 1000000 -k 21,65,85,105,115 -F embplant_mt

Please advise me. I tried this program on my Ubuntu server, program was installed using conda, version is GetOrganelle v1.7.5.3, 64 cores. I am not sure what other information I should provide, When this program ended, I got following files- ./ ├── extended_1.fq ├── extended_2.fq ├── get_org.log.txt └── seed ├── embplant_mt.initial.fq └── embplant_mt.initial.sam

1 directory, 5 files

I will be grateful from your help. Further, will it be possible to restart this program as It took quite some time to reach to this stage.

thanks you in advance for your help,

with best regards Amit

amit4mchiba avatar Apr 06 '22 18:04 amit4mchiba

Sorry for the delay. I just posted an update on FAQ https://github.com/Kinggerm/GetOrganelle/wiki/FAQ#memoryerror

I would first try using --memory-save and -w 195. BTW, your dataset looks difficult, with a huge dataset but limited relative coverage.

Kinggerm avatar Apr 08 '22 16:04 Kinggerm

Dear Kinggerm,

Thank you so much for your reply.

I tried following your suggestion and used following command- get_organelle_from_reads.py
-1 /mnt/HD1/Sophla_HiC_juicer/Plant_genome_related/Sf_ill_paired_R1.fq.gz
-2 /mnt/HD1/Sophla_HiC_juicer/Plant_genome_related/Sf_ill_paired_R2.fq.gz
-s ./Sf_mito_seeds1.fasta
-a ./Sf_chloro_seeds1.fasta --memory-save -w 195
-t 32 -o Sf_mitochrondia_output_9thApril22 -R 30 -k 21,65,85,105 -F embplant_mt

But I got the same error as before. I am not sure what limited coverage actually means. Actually, my plant genome size is 1.8Gb, and I got illumina sequencing worth 60x genome coverage with the objective to polish the genome post achieving pacbio based genome assembly. In the process, I wanted to also extract genomes of plastids and mitochrondia. I was able to get circular chromosome for plastids, but it keeps failing for mitochroindia. I thought that using seed and anti-seed will help, but i failed again.

Can something be done? Any suggestion? Do you recommend to increase the sequencing coverage?

thank you so much in advance,

with best regards Amit

amit4mchiba avatar Apr 09 '22 15:04 amit4mchiba

The absolute coverage of 262.96 is definitely enough for any organelle genomes, but it was estimated based on 37.5G of the 108G data, meaning that the ratio is very low. The advantage of GetOrganelle or any extending-based tools is based on the coverage differences among pt, mt and nuclear. The flatten the differences are, the less efficient the approach is.

Please attach the new log file with -w and --memory-save. The suggestions below will be changing upon the new log file and should not be taken seriously.


Anti-seed can be a doable strategy for this, but it is laborious and I did not make a user guide. For this data, what you need to do will be doing the following steps iteratively:

  1. Do a quick assembly with -R 2.
  2. Visualize the assembly graph in Bandage (along with the csv) and pick the mt/pt contigs and save them into mt_pt.fasta and save the nuclear contigs as the nucl.fasta. Discard uncertain contigs. Go to step one with mt_pt.fasta as the new seed and nucl.fasta as the new anti-seed.

Another flag that can be used simultaneously is "--max-extending-len". You may try 5000.

If you have finished the whole genome assembly with tools such as SPAdes, Velvet that could generate assembly graph file, you may use get_organelle_from_assembly.py to extract the target mt genome or (more likely) scaffolds from that file.

Kinggerm avatar Apr 09 '22 16:04 Kinggerm

Thank you so much for the advice. I am running what you suggested and will post you once result is ready. Thank you so much.

amit4mchiba avatar Apr 10 '22 16:04 amit4mchiba