DeepVirFinder
DeepVirFinder copied to clipboard
Prediction refactoring
Potential related issues
All issues related to long runtime: #21, #24, #27
Description
After spending a day figuring out why the software takes hours to finish on some sequences, I found several issues:
- Most of the software runtime is not linked to prediction but to pvalue calculation. This was previously done using pure python code and could be massively speed up using numpy.
- The multiple CPUs option python
multiprocessing
module sometimes does not work. An issue which might be related to numpy inference on CPU attribution. - The code is not well written and could be fixed in multiple ways.
Pull request description
Complete rewritting of the main dvf.py
script. This should produce exactly the same results than before. The results order might change however.
With one core and using the CRC test file, old version took 102 seconds while new one and in 29 seconds, meaning a 3.5 times improvement. I suspect the improvement to be actually much better than this since half of the time (~ 15 seconds) is used here to load models. New version of the software does not include the multiprocessing step.
I implement a random check that new pvalue equals old value every 1,000 predictions but this has never been trigger for all my tests.
I added a compare_results.py
script to compare results regardless of the order.
@jsgounot Hi, Thanks for the commit. I tried to run the modified dvf.py file. However, I get a segmentation fault error. Here is my code
python /home/soft/DeepVirFinder/dvf_mod.py
-i 0.1.assembly/0.3.merged.contigs/merged.contigs.fix.fasta
-o 0.3.metavirome/0.1.identify_vcontig/0.2.deepvir`
pid 148590's current affinity mask: ffffffffffffffffffffffffff pid 148590's new affinity mask: ff Using Theano backend. WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
- Loading Models. Model directory /home/soft/DeepVirFinder/models
- Encoding and Predicting Sequences. Predict for c_000000000001 [Segmentation fault (core dumped)`
Also ther is no -c
option for multithreading
HI @ShailNair, it's been a while since I did this and to be honest, I don't remember most of it. Regarding your segmentation fault, I guess this can be linked directly to your numpy and keras installation (my guess would be during the prediction part). One way to quickly check this would be to use multiple print statements. Concerning the multi threading, as I said in my original post, the previous implementation leads to error on my server which made me remove this part. I would be happy to answer some questions but please do not expect me to provide full support, I'm not one of the author of this tool and people behind this project completely ignored all the merge requests before.
@jsgounot Thanks for the quick reply.
Regarding your segmentation fault, I guess this can be linked directly to your numpy and keras installation (my guess would be during the prediction part). One way to quickly check this would be to use multiple print statements.
Yes. I will check regarding this.
I'm not one of the author of this tool and people behind this project completely ignored all the merge requests before.
That's unfortunate as multiple users have faced the speed issue with DeepVirFinder. I hope the developers @jessieren soon take an action.