GetOrganelle icon indicating copy to clipboard operation
GetOrganelle copied to clipboard

Speedup suggestion during initial FASTQ decompression

Open edgardomortiz opened this issue 4 years ago • 5 comments

Thanks for developing GetOrganelle, it seems very complete and thorough. I am trying it for species of Ericaceae, hopefully it will handle the small repeats better than other software I tried in the past (any tips to improve these assemblies are welcome).

However, during my initial tests in a Mac I noticed it takes an excessive amount of time just to decompress the FASTQ files at the begginning (a file of ~5GB is taking more that 1.5 hours), my guess is that the combination of Mac's head + gunzip is the reason, I found that many of Mac's own standard programs are really slow compared to Linux's versions. My suggestion would be to use Python's own gzip library to decompress and compress reads more quickly, if not, the BBTools suite (https://jgi.doe.gov/data-and-tools/bbtools/) handles FASTQ files very fast as well, and a random subsampling could be performed with its program reformat.sh

Edgardo

edgardomortiz avatar Sep 25 '20 11:09 edgardomortiz

Hi Edgardo,

Thanks for using GetOrganelle and for the kind suggestion. I will carefully consider and test it.

As for Ericaceae, it will still be difficult with only illumina data. I am developing another tool/function for utilizing long read sequencing reads for this. Hopefully it will be helpful if you have these kind of data.

Best, Jianjun

Kinggerm avatar Sep 26 '20 15:09 Kinggerm

Hi,I conducted GetOrganelle and found these Errors like this:

GetOrganelle v1.7.1

get_organelle_from_reads.py assembles organelle genomes from genome skimming data. Find updates in https://github.com/Kinggerm/GetOrganelle and see README.md for more information.

Python 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] PYTHON LIBS: GetOrganelleLib 1.7.1; numpy 1.19.1; sympy 1.6.2; scipy 1.3.0; psutil 5.4.7 DEPENDENCIES: Bowtie2 /public/home/aaa/anaconda3/bin/bowtie2-align-s); SPAdes 3.13.0; Blast 2.9.0 LABEL DB: embplant_mt customized; embplant_pt customized WORKING DIR: /public/home/aaa/project/01_tea/DASZ_mt/assemble /public/home/aaa/anaconda3/bin/get_organelle_from_reads.py -s tea.mt.fasta -1 DASZ.R1.fastq.gz -2 DASZ.R2.fastq.gz -o DASZ_mt -R 50 -k 55,85,115,125,135 -F embplant_mt -t 6

2020-09-29 12:59:33,138 - INFO: Pre-reading fastq ... 2020-09-29 12:59:33,139 - INFO: Estimating reads to use ... (to use all reads, set '--reduce-reads-for-coverage inf') 2020-09-29 12:59:33,365 - INFO: Tasting 100000+100000 reads ... 2020-09-29 12:59:34,205 - ERROR: Traceback (most recent call last): File "/public/home/fafu_chenshuai/anaconda3/bin/get_organelle_from_reads.py", line 3750, in main random_seed=options.random_seed, verbose_log=options.verbose_log, log_handler=log_handler) File "/public/home/fafu_chenshuai/anaconda3/bin/get_organelle_from_reads.py", line 1014, in estimate_maximum_n_reads_using_mapping which_bowtie2=which_bowtie2) File "/public/home/fafu_chenshuai/anaconda3/lib/python3.7/site-packages/GetOrganelleLib/pipe_control_func.py", line 373, in map_with_bowtie2 raise Exception("") Exception

Total cost 26.55 s Please email [email protected] or [email protected] if you find bugs!

Sh1ne111 avatar Sep 29 '20 06:09 Sh1ne111

Hi,I conducted GetOrganelle and found these Errors like this:

GetOrganelle v1.7.1

get_organelle_from_reads.py assembles organelle genomes from genome skimming data. Find updates in https://github.com/Kinggerm/GetOrganelle and see README.md for more information.

Python 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] PYTHON LIBS: GetOrganelleLib 1.7.1; numpy 1.19.1; sympy 1.6.2; scipy 1.3.0; psutil 5.4.7 DEPENDENCIES: Bowtie2 /public/home/aaa/anaconda3/bin/bowtie2-align-s); SPAdes 3.13.0; Blast 2.9.0 LABEL DB: embplant_mt customized; embplant_pt customized WORKING DIR: /public/home/aaa/project/01_tea/DASZ_mt/assemble /public/home/aaa/anaconda3/bin/get_organelle_from_reads.py -s tea.mt.fasta -1 DASZ.R1.fastq.gz -2 DASZ.R2.fastq.gz -o DASZ_mt -R 50 -k 55,85,115,125,135 -F embplant_mt -t 6

2020-09-29 12:59:33,138 - INFO: Pre-reading fastq ... 2020-09-29 12:59:33,139 - INFO: Estimating reads to use ... (to use all reads, set '--reduce-reads-for-coverage inf') 2020-09-29 12:59:33,365 - INFO: Tasting 100000+100000 reads ... 2020-09-29 12:59:34,205 - ERROR: Traceback (most recent call last): File "/public/home/fafu_chenshuai/anaconda3/bin/get_organelle_from_reads.py", line 3750, in main random_seed=options.random_seed, verbose_log=options.verbose_log, log_handler=log_handler) File "/public/home/fafu_chenshuai/anaconda3/bin/get_organelle_from_reads.py", line 1014, in estimate_maximum_n_reads_using_mapping which_bowtie2=which_bowtie2) File "/public/home/fafu_chenshuai/anaconda3/lib/python3.7/site-packages/GetOrganelleLib/pipe_control_func.py", line 373, in map_with_bowtie2 raise Exception("") Exception

Total cost 26.55 s Please email [email protected] or [email protected] if you find bugs!

I'm sorry that your question is irrelevant to this issue. Please open another issue. I have to delete your question here soon.

Kinggerm avatar Sep 29 '20 14:09 Kinggerm

@Kinggerm Does get organelle pull in Pigz as well when installing via conda? If so, that would be a lot better as pigz is foolishly fast!

harish0201 avatar Apr 17 '21 16:04 harish0201

@harish0201 That's true. But currently pigz is not required for non-conda installation. Further incorporating needs more testing in different environment, it's on my plan though. Thanks for the suggestions.

Kinggerm avatar Apr 18 '21 11:04 Kinggerm