rgi icon indicating copy to clipboard operation
rgi copied to clipboard

RGI bwt run time?

Open wichne opened this issue 5 years ago • 9 comments

Hello, I am running RGI on a metagenomic dataset. The execution looked something like $ rgi bwt -1 READ_ONE -2 READ_TWO -a bowtie2 -n 20 -o OUTPUT_FILE --include_wildcard

The size of the metagenome data is ~500M paired reads.

The process got through generation of the *.model_species_data_type.temp.txt file, and has now been grinding away for over 10 days on the next step (generation of the *.gene_mapping_data.txt file?)

Is this expected behavior? How long does this process take?

wichne avatar Sep 15 '20 18:09 wichne

@wichne I will test this and report back.

raphenya avatar Oct 07 '20 13:10 raphenya

@wichne tested with READ_ONE (223MB) and READ_TWO (246MB) compressed, it took about 20 mins.

raphenya avatar Oct 08 '20 15:10 raphenya

For what it's worth, I've been having similar issues with metagenomic datasets, which have been making the tool unusable for the majority of my data (which has up to 100M read pairs in some cases). Happy to help debug this further.

bsiranosian avatar Dec 06 '20 16:12 bsiranosian

@bsiranosian @wichne I will write a script we can both use to test. I will add it here. Cheers.

raphenya avatar Dec 07 '20 16:12 raphenya

Hi all,

Our lab encountered something similar with some large WGS files (>100 M reads) when using BWT for wildcard detection (runtime > 48 hours). We were using 4 threads (which, to be fair is not a ton) but increases to 8 was not providing much of an increase to performance.

After some light profiling, we suspect the hang-up/slowdown occurs when parallelizing the jobs and reading the .seqs.txt file for hits. Essentially, in get_reads_count() csv.reader iterates over the entire seqs file and finds relevant hits. We tossed together some code which utilizes Dask and Pandas to pull hits at once (without paralellization of jobs) and it seems to produce equivalent outputs but in a shorter time span on the same machine.

Not sure if it'd be helpful to share (since it forces us to forego some other functionalities of the RGI tool) but hopefully it could provide some insight about potential bottlenecks. We're going to be in attendance during tomorrow's Q&A at CARD 2021 (and I believe we have some one-on-one time scheduled for questions) and would be more than happy to show how we circumvented the problem.

Best, John

jwframe28 avatar Feb 16 '21 22:02 jwframe28

@jwframe28 Yes! that's where the bottleneck is. I will write the code and take your suggestions into consideration. Cheers.

raphenya avatar Feb 17 '21 16:02 raphenya

Hi all, Has there been any updates on this? I'd like to use RGI bwt for my data.

jasonarothman avatar Aug 24 '21 00:08 jasonarothman

Hi @raphenya, is it possible there are issues when running jobs on an HPC? When running RGI locally it seems to utilize the requested number of cores, but when running on our LSF cluster it doesn't seem to use more than one.

Edit: after upgrading from 5.2.1 to 6.0.0 I am not seeing the same behaviour: the intended number of threads are being utilized!

nickp60 avatar Sep 29 '22 14:09 nickp60

Re: is it possible there are issues when running jobs on an HPC?

@nickp60 I have ran rgi bwt using serial method outlined here https://github.com/arpcard/rgi#running-rgi-on-compute-canada-serial-farm.

Re: Edit: after upgrading from 5.2.1 to 6.0.0 I am not seeing the same behaviour: the intended number of threads are being utilized!

Thank you. Please share your tests scripts (if you can).

Currently, I'm using python's cProfile to get performance stats and make plots. See below:


# generate pstats
python3 -m cProfile -o profile_rgi_bwt_bwa.pstats $(which rgi) bwt -1 10_R1.fastq.gz -2 10_R2.fastq.gz -o output1 --debug --clean --local -a bwa -n 10

python3 -m cProfile -o profile_rgi_bwt_kma.pstats $(which rgi) bwt -1 10_R1.fastq.gz -2 10_R2.fastq.gz -o output1 --debug --clean --local -a kma -n 10

python3 -m cProfile -o profile_rgi_main.pstats $(which rgi) main -i homolog.fasta -o out1 -n 10 --debug --clean --local > out1.log 2>&1

# make plots
gprof2dot profile_rgi_bwt_bwa.pstats | dot -Tpdf -o profile_rgi_bwt_bwa.pdf
gprof2dot profile_rgi_bwt_kma.pstats | dot -Tpdf -o profile_rgi_bwt_kma.pdf
gprof2dot profile_rgi_main.pstats | dot -Tpdf -o profile_rgi_main.pdf

But I don't like the plots, and I'm trying to simplify. If successful, we can add it as a tool the user can run.

raphenya avatar Oct 14 '22 15:10 raphenya

Hi,

I have encountered this bug, too. Are there any updates on the fix? What is the expected execution time for this procedure?

Best regards

davidecrs avatar Mar 14 '23 11:03 davidecrs

Issue is stale and will be closed in 7 days unless there is new activity

github-actions[bot] avatar Oct 17 '23 11:10 github-actions[bot]