DeepVirFinder icon indicating copy to clipboard operation
DeepVirFinder copied to clipboard

Prediction refactoring

Open jsgounot opened this issue 2 years ago • 3 comments

Potential related issues

All issues related to long runtime: #21, #24, #27

Description

After spending a day figuring out why the software takes hours to finish on some sequences, I found several issues:

  • Most of the software runtime is not linked to prediction but to pvalue calculation. This was previously done using pure python code and could be massively speed up using numpy.
  • The multiple CPUs option python multiprocessing module sometimes does not work. An issue which might be related to numpy inference on CPU attribution.
  • The code is not well written and could be fixed in multiple ways.

Pull request description

Complete rewritting of the main dvf.py script. This should produce exactly the same results than before. The results order might change however.

With one core and using the CRC test file, old version took 102 seconds while new one and in 29 seconds, meaning a 3.5 times improvement. I suspect the improvement to be actually much better than this since half of the time (~ 15 seconds) is used here to load models. New version of the software does not include the multiprocessing step.

I implement a random check that new pvalue equals old value every 1,000 predictions but this has never been trigger for all my tests.

I added a compare_results.py script to compare results regardless of the order.

jsgounot avatar Jan 05 '22 04:01 jsgounot

@jsgounot Hi, Thanks for the commit. I tried to run the modified dvf.py file. However, I get a segmentation fault error. Here is my code

python /home/soft/DeepVirFinder/dvf_mod.py
-i 0.1.assembly/0.3.merged.contigs/merged.contigs.fix.fasta
-o 0.3.metavirome/0.1.identify_vcontig/0.2.deepvir`

pid 148590's current affinity mask: ffffffffffffffffffffffffff pid 148590's new affinity mask: ff Using Theano backend. WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

  1. Loading Models. Model directory /home/soft/DeepVirFinder/models
  2. Encoding and Predicting Sequences. Predict for c_000000000001 [Segmentation fault (core dumped)`

Also ther is no -c option for multithreading

ShailNair avatar Sep 19 '22 01:09 ShailNair

HI @ShailNair, it's been a while since I did this and to be honest, I don't remember most of it. Regarding your segmentation fault, I guess this can be linked directly to your numpy and keras installation (my guess would be during the prediction part). One way to quickly check this would be to use multiple print statements. Concerning the multi threading, as I said in my original post, the previous implementation leads to error on my server which made me remove this part. I would be happy to answer some questions but please do not expect me to provide full support, I'm not one of the author of this tool and people behind this project completely ignored all the merge requests before.

jsgounot avatar Sep 19 '22 02:09 jsgounot

@jsgounot Thanks for the quick reply.

Regarding your segmentation fault, I guess this can be linked directly to your numpy and keras installation (my guess would be during the prediction part). One way to quickly check this would be to use multiple print statements.

Yes. I will check regarding this.

I'm not one of the author of this tool and people behind this project completely ignored all the merge requests before.

That's unfortunate as multiple users have faced the speed issue with DeepVirFinder. I hope the developers @jessieren soon take an action.

ShailNair avatar Sep 19 '22 05:09 ShailNair