padelpy
padelpy copied to clipboard
Padelpy GPU version?
Hello I want to know if the library is compatible on GPU? The PadelPy library on CPU is quite slow to generate fingerprints of around ~10000 molecules it takes me around 3-4 hours or even more sometimes. if GPU version isn't available, how can the process be speed up? Please let me know Thanks Anjali
Hi @AnjaliSetiya,
The source code for PaDEL-Descriptor, while open source, is written in Java which, I will not lie, is not a language I have much experience with.
I'm going to leave this issue open, hopefully someone with more familiarity with Java and/or PaDEL-Descriptor's source code can chime in (and let us know if this is possible!).
Best, Travis
I don't about GPU programming to accelerate this, which I think would need to be done upstream in the actual PaDEL-Descriptor source code, but what could be done here is using Python's multiprocessing
to divide the lists of molecules into as many processes as possible. It won't get anywhere near the speedup of a true GPU implementation of the actual fingerprint calculation algorithm, but it would hopefully cut execution times down quite substantially -- there will be very little communication overhead and I expect that speedup should scale linearly with the number of processes.
Please let me know if this is of any interest and I can open a PR @tjkessler @AnjaliSetiya
Hi @JacksonBurns, Please let me know what contributes for a PR.
@AnjaliSetiya after further investigation I realized that padelpy
actually has a passthrough to PaDel that takes advantage of multiprocessing. This should buy you some huge speedups if you aren't doing it already. See example:
This code snippet takes about 3.5 minutes to run:
smiles = ['C'*50]*100
from padelpy import from_smiles
for smi in smiles:
from_smiles(smi)
whereas this takes only 11 seconds:
smiles = ['C'*50]*100
from padelpy import from_smiles
from_smiles(smiles)
As far as a GPU version goes, I'm not sure if that's really possible. I can't even find the source code to begin with, but on top of that the calculation of descriptors is a lot of short, 'bursty' calculations that probably won't benefit much. You could also consider looking at this reimplementation that seems to be much faster. Another compelling option would be to just use PaDel directly, rather than through this Python wrapper, and save the output file to later be read into Python.