spyrmsd
spyrmsd copied to clipboard
RMSD with multicore and timeout functionality
As we discussed in #105: here is my first shot at implementation of RMSD functions with multicore and timeout functionalities.
For now, I kept the rmsd_timeout
function public, however we could also just make it private. This function supports a molref with multiple mols, something that the multicore function doesn't support directly (everything is done pairwise here). Although I guess there is an easy workaround for the user?
Maybe it's even better to implement this ourselves: if only one mol is supplied as the ref mol, extend the list to match the length of the mol list? (like this [refmol]*len(mols)
). This would also give each molecule a seperate timeout. Having the possibility of a mollist in the rmsd_timeout
function also made me return the raw result of rmsdwrapper (a list) instead of the value (because otherwise only one value would be returned). So if we make it private, it could clean up things and also emulate the regular rmsdwrapper functionality better! (I just came up with this while writing this description, if you like it I will implement it tomorrow!)
I also still need to add tests. I guess we could add a test for a timeout, a test for matching mols and molrefs list, ... But multicore might be harder to test?
Any feedback welcome as always!
Note: I intentionally kept all the new funtions and imports together at the bottom so it's more clear to see what's added. This makes Flake8 doesn't pass. MyPy also doesn't pass yet, but I'll get to fixing that tomorrow as well!
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 99.10%. Comparing base (
da6be7f
) to head (10a47be
). Report is 14 commits behind head on develop.
Additional details and impacted files
@Jnelen sorry, I clicked "Update Branch" without too much thinking. You'll have to pull before pushing further changes, otherwise you will get an error. Sorry about that!
For the Value ctype I picked a regular, single precision float. However we could also use a double precision float if you think this is more appropriate!
Thanks! Please don't waste too much time on
mypy
issues. It's supposed to be helpful, not annoying. If it becomes too much, we can simply turn it off for the offending line(s) and deal with it another time.
I just suppressed the 2 remaining MyPY errors. One of them should hopefully get fixed by MyPy (see the earlier issue I referred to), the other is the problem that os.cpu_count() can return None
sometimes, which would give problems when using the min(num_workers, os.cpu_count())
. I tried your patch, but it was still complaining.
Functionally I think it's basically how we want to have, but perhaps some parts of the implementation can still be cleaned up? Besides that, I suppose some tests should be added as well? The timeout feature should be relatively easy to test (by using one of the "problematic" compounds that made me start looking into this in the first place. But i'm not sure how to properly test the multicore feature?
By the way, there is no rush in getting this finished quickly, so please only review it if you actually have the time!
Besides that, I suppose some tests should be added as well?
Yes, we need tests. =)
The timeout feature should be relatively easy to test (by using one of the "problematic" compounds that made me start looking into this in the first place).
Yes. I think the worst one you reported is a great candidate.
But i'm not sure how to properly test the multicore feature?
I think GitHub Actions have access to multiple cores, but I should double check. In any case, the test would be on correctness.
Eventually we could also add a benchmark test, but it's not a priority IMO. If you have some results from your work (where you see an improvement), you can just post the timings here for future reference. But not a priority, the important thing is that it's correct.
By the way, there is no rush in getting this finished quickly, so please only review it if you actually have the time!
Thanks. You are doing great work, so I don't want it to stall. ;)
I added some very basic tests for the prmsdwrapper function. (I basically copied the tests with the rmsdwrapper function and used the prsmdwrapper function instead). I also added a test for the timeout function. However one thing I noticed is that for some reason this runs twice: once with a True and once with a False variable. When the tests were giving errors I got this:
FAILED tests/test_rmsd.py::test_prmsdwrapper_single_molecule_timeout[True] - assert nan is nan
FAILED tests/test_rmsd.py::test_prmsdwrapper_single_molecule_timeout[False] - assert nan is nan
I fixed the error of the assert, but the test still runs twice. I'm still confused since there isn't a parametrize, or even an input variable. I couldn't immediately figure it out, but I assume it's getting this from some earlier statement?
Ofcourse there is still some work left improving the actual prmsdwrapper function, but I suppose some basic tests can also add sanity checks to ensure we're not breaking stuff.
Thanks @Jnelen! I'm a bit swamped this week too, but I'll try to have a look and reply ASAP. Sorry for the delays.
I'm still confused since there isn't a parametrize, or even an input variable. I couldn't immediately figure it out, but I assume it's getting this from some earlier statement?
Yes, this is a bit confusing. I think the two runs are coming from the following fixture, which is automatically applied to all the tests of the file (autouse=True
).
https://github.com/RMeli/spyrmsd/blob/08764e1c3a9eb4b636c14ddf083e1cbbe50014f9/tests/test_rmsd.py#L11-L15
It's used to trigger a separate code path deep inside the minimized RMSD calculation. So the two runs are normal and you should not worry about them, but great you look this closely at what's happening!
Thanks @Jnelen! I'm a bit swamped this week too, but I'll try to have a look and reply ASAP. Sorry for the delays.
No problem at all. Actually, I will probably also be quite inactive until Wednesday next week. I should still be able to respond to some comments, but I probably won't be able to change or test any new code until then. Just as a heads-up!
I'm still confused since there isn't a parametrize, or even an input variable. I couldn't immediately figure it out, but I assume it's getting this from some earlier statement?
Yes, this is a bit confusing. I think the two runs are coming from the following fixture, which is automatically applied to all the tests of the file (
autouse=True
).https://github.com/RMeli/spyrmsd/blob/08764e1c3a9eb4b636c14ddf083e1cbbe50014f9/tests/test_rmsd.py#L11-L15
It's used to trigger a separate code path deep inside the minimized RMSD calculation. So the two runs are normal and you should not worry about them, but great you look this closely at what's happening!
Ah that explains it, thanks for the clarification!
I'm back from my break! Right before that I actually updated my use-case (ESSENCE-Dock) to work with a time-out and multicore, but I ended up using Pebble since this handles both at the same time. For my specific use-case I think it was the most elegant solution, and I use a Singularity image anyway, so I didn't mind adding an additional dependency.
However this got me thinking if we should change the implementation of the current multicore and timeout implementation. Maybe I had my (specific) use-case too much in mind? It definitely is functional, and will work just fine to process a large number of molecules across different CPUs. However the calculation of graph-isomorphisms isn't currently accelerated, but for that I assume we need to look into specific solutions for the different backends (since I assume they can take care of the multicore support)? However, I could also see how this could lead to problems with the current multicore implementation, since it will want to use multiple cores in each process? That's why I'm wondering if we should make larger changes to the current version.
Finally as for a benchmark, I haven't ran a proper one yet, but I'll try to run one later with the current implementation.
Hi,
I finally got around to work on a simple benchmark, but I have started to notice quite a notable difference between running the calculations the first time versus subsequent times. Here is a concrete example for the same dataset. Running it the first time:
real 0m30.536s
user 0m17.145s
sys 0m2.811s
Versus the second (and subsequent) time(s):
real 0m11.677s
user 0m11.957s
sys 0m2.657s
Do you know why this is? I'm not sure, but I think this is because of the loading of the molecules, and that in subsequent runs these are maybe cached? Also note the discrepancy between the real time and the user time in the first run. I tried several different datasets, and this pattern seems to be very common. Do you know what could be happening?
I will try to follow-up later with some concrete "benchmarks" and more context!
Alright, I did some testing, and the performance of the parallel implementation is a lot worse than I thought it would be to be honest.. In the edge-cases tests I was doing it seemed good, because even if some "problematic" molecules were present it would still finish in a reasonable amount of time, where as with the "regular" implementation it would get killed or take a very long time. However, in more "normal" scenarios, the current prmsdwrapper seems to degrade the performance by a lot. I used some docking examples from DUD-E to test. This example is for HS90A, where I used the LeadFinder (LF) and Gnina (GN) results to calculate the RMSD for 4938 compounds. You can find the scripts I used here: benchmark_scripts.zip.
Anyway, here are my numbers:
- Regular (
rmsdwrapper
)
real 0m11.990s
user 0m12.207s
sys 0m2.473s
- Parallel (1 Core)
real 2m56.928s
user 0m25.358s
sys 2m39.194s
- Parallel (4 Cores)
real 0m51.843s
user 0m24.107s
sys 2m45.873s
- Parallel (8 Cores)
real 0m29.346s
user 0m23.753s
sys 2m40.650s
- Parallel (16 Cores)
real 0m18.365s
user 0m21.712s
sys 2m37.021s
Something else I noticed is that error handling in the prmsdwrapper
should be improved as well. To me it seems like in the current implementation it seems to almost never really be worth it to use prmsdwrapper
, except when there are many edge-cases that can take a very long time or even kill the python process. However at this point it might be better for the user to implement specific changes/checks? I think our concept of the prmsdwrapper is still a good idea, but my implementation seems to be performing very suboptimal.. I would love to have a clean and efficient prmsdwrapper function merged, but this definitely isn't it. I also have many other projects coming up, so unfortunately I think that I will have less time to allocate for this as well. What do you think?
Hi @Jnelen
but I ended up using Pebble since this handles both at the same time. For my specific use-case I think it was the most elegant solution, and I use a Singularity image anyway, so I didn't mind adding an additional dependency.
I ended up looking into pebble
too during my investigation, since it seems to do what we need. I wouldn't mind adding an additional (optional!) dependency if you thing it's the most elegant.
However the calculation of graph-isomorphisms isn't currently accelerated, but for that I assume we need to look into specific solutions for the different backends (since I assume they can take care of the multicore support)?
I think your use case is rather common. Acceleration of graph isomorphisms is up to the backend, but as you mentioned previously graph-tool
did not seem to use multiple processes for it. And I looked into cugraph
, and it's also missing there, so I'm not sure where a parallel implementation is currently available. In any case, I think your use case is rather common (virtual screening) that warrants parallelism over molecules.
I finally got around to work on a simple benchmark, but I have started to notice quite a notable difference between running the calculations the first time versus subsequent times.
I have no idea why this happens on the top of my head, it depends how you run the benchmark. Maybe it's something related to __pycache__
?
However, in more "normal" scenarios, the current
prmsdwrapper
seems to degrade the performance by a lot.
Spawning threads/processes has some overhead. I'm wondering if the fact that we do this (plus the creation of the shared memory result) every time _rmsd_timeout
is called adds an unacceptable overhead. However, I looked at your benchmarks and if you are measuring the total time of the script I think you are also accounting for the I/O and preparation of the lists. I think a better comparison would be to time only prmsdwrapper
and the loop over molecules for rmsdwrapper
(or maybe just accumulate the time that rmsdwrapper
takes, without the loop).
I would love to have a clean and efficient prmsdwrapper function merged, but this definitely isn't it. I also have many other projects coming up, so unfortunately I think that I will have less time to allocate for this as well. What do you think?
Maybe leaving out the timeout function for now (which I know is needed) would allow to have a more efficient prmsdwrapper
? I'm not sure though, as usual in these cases it needs to be implemented and benchmarked. I totally understand if you have no time to work on this. This PR added a lot of pieces to the puzzle anyways, so it has been extremely useful to get started even if you decide to pause it (in which case we can convert it to a draft), or simply close it for now. Up to you. I also have little time to work on this at the moment, I'm sorry about that.
In the meantime I'll try to make a release soon without this PR, so your other feature #107 is available to everyone.
Hi @Jnelen
but I ended up using Pebble since this handles both at the same time. For my specific use-case I think it was the most elegant solution, and I use a Singularity image anyway, so I didn't mind adding an additional dependency.
I ended up looking into
pebble
too during my investigation, since it seems to do what we need. I wouldn't mind adding an additional (optional!) dependency if you thing it's the most elegant.However the calculation of graph-isomorphisms isn't currently accelerated, but for that I assume we need to look into specific solutions for the different backends (since I assume they can take care of the multicore support)?
I think your use case is rather common. Acceleration of graph isomorphisms is up to the backend, but as you mentioned previously
graph-tool
did not seem to use multiple processes for it. And I looked intocugraph
, and it's also missing there, so I'm not sure where a parallel implementation is currently available. In any case, I think your use case is rather common (virtual screening) that warrants parallelism over molecules.I finally got around to work on a simple benchmark, but I have started to notice quite a notable difference between running the calculations the first time versus subsequent times.
I have no idea why this happens on the top of my head, it depends how you run the benchmark. Maybe it's something related to
__pycache__
?However, in more "normal" scenarios, the current
prmsdwrapper
seems to degrade the performance by a lot.Spawning threads/processes has some overhead. I'm wondering if the fact that we do this (plus the creation of the shared memory result) every time
_rmsd_timeout
is called adds an unacceptable overhead. However, I looked at your benchmarks and if you are measuring the total time of the script I think you are also accounting for the I/O and preparation of the lists. I think a better comparison would be to time onlyprmsdwrapper
and the loop over molecules forrmsdwrapper
(or maybe just accumulate the time thatrmsdwrapper
takes, without the loop).I would love to have a clean and efficient prmsdwrapper function merged, but this definitely isn't it. I also have many other projects coming up, so unfortunately I think that I will have less time to allocate for this as well. What do you think?
Maybe leaving out the timeout function for now (which I know is needed) would allow to have a more efficient
prmsdwrapper
? I'm not sure though, as usual in these cases it needs to be implemented and benchmarked. I totally understand if you have no time to work on this. This PR added a lot of pieces to the puzzle anyways, so it has been extremely useful to get started even if you decide to pause it (in which case we can convert it to a draft), or simply close it for now. Up to you. I also have little time to work on this at the moment, I'm sorry about that.In the meantime I'll try to make a release soon without this PR, so your other feature #107 is available to everyone.
Hi @RMeli
What I will try to do when I find some time is to implement the prmsdwrapper but with pebble, and try to benchmark that as well. If the performance is (a lot) better we could try to move forward with it and get it implemented (which would require the optional pebble
dependency). However if the performancestill isn't good, I would suggest to pause this idea for now and turn it into a draft, just like you suggested. What do you think?
@Jnelen sounds good to me, thanks for all the work!
I finally got around to integrate pebble
into a multicore function. I kept the "old" version for now, but if you agree, I will delete those, and just keep the pebble
implementation. I currentely named it pebblermsdwrapper
, but I will rename it to prmsdwrapper after removing the original implementation. (if you agree) The pebble version is a LOT faster, and also just a lot simpler.
To benchmark it, I used the GCR library from DUD-E, which contains around 15k compounds. This time I only time the RMSD calculations themselves, and exclude the loading of the molecules (io.loadmol
)
- regular
rmsdwrapper
function: 12.00s -
pebblermsdwrapper
with num_worker=1: 14.27s -
pebblermsdwrapper
with num_worker=2: 7.69s -
pebblermsdwrapper
with num_worker=4: 4.93s
So when using one core, there is of course some overhead, but I find this more than reasonable in this case. Also this has the timeout feature, so it can be used for to handle "edgecases" as well. Finally, there is also a "chunksize" parameter, which means multiple molecules are processed per process. Currently, I have kept this in, but not sure if this should be present in the final version. It can give a speedup, but if it times out or gives an error, all the molecules from that chunk/batch that still needed to be processed will fail. The standard probably should be 1, but I guess giving atleast the option to configure it is good? Here are the same numbers with a chunksize of 10 for reference:
-
pebblermsdwrapper
with num_worker=1, chunksize=10: 12.87s -
pebblermsdwrapper
with num_worker=2, chunksize=10: 6.70s -
pebblermsdwrapper
with num_worker=4, chunksize=10: 3.73s
Overall, I think using pebble
for this is the way forward. Pebble itself also doesn't require any dependencies, and is really light weight. I would love to hear your feedback/thoughts. If you agree, I will take a second look at the code and make some cleanups .
The pebble version is a LOT faster, and also just a lot simpler.
Do I need to read further? =)
This time I only time the RMSD calculations themselves, and exclude the loading of the molecules (
io.loadmol
).
Thanks! Would you mind sharing those benchmarks as well, just for completeness?
So when using one core, there is of course some overhead, but I find this more than reasonable in this case. Also this has the timeout feature, so it can be used for to handle "edgecases" as well.
Yes, a bit of overhead is expected and I too think it looks reasonable in this case. And if the timeout feature comes "for free", that's even better.
The standard probably should be 1, but I guess giving at least the option to configure it is good?
Yes, a reasonable default with the option to configure it for power user is great. I'd totally keep the "chunk" .
Overall, I think using pebble for this is the way forward.
I agree, thanks for looking into it! The project has 500+ stars on GitHub and it's currently under development, so I think it's definitely a reasonable inclusion (especially in light of your investigation!).
I'll have a proper look at the code later or in the coming days, but everything sounds great! Thanks a lot for the investigation and thorough report. The only thing I'd say is to keep pebble
as an optional dependency anyways, since it's not critical for the functionality of spyrmsd
, but a great addition. Given this, I'd maybe move everything related to this into a spyrmsd.parallel
module, so that one would do
from spyrmsd.parallel import prmsdwrapper
The optional dependency can be done at the module level with something like the following:
try:
import pebble
except:
errmsg = ("Parallel execution of SPyRMSD (`prmsdwrapper`) requires `pebble`."
"Please install `pebble`: https://github.com/noxdafox/pebble")
raise ImportError(errmsg)
PS: Thanks to your PR on backend selection I'm simplifying CI and realised the test/molecules.py
is really bad (it lead to bad design choices), and problematic when used with @pytest.parametrize
and multiple backends. Depending on which PR gets in first, one or the other will need some tweaking of the tests. If this is too annoying for you, I can hold off the other PR until this is done (although CI would be much faster on this PR with the other one in). Just let me know.
Cycling CI for https://github.com/RMeli/spyrmsd/issues/119
This time I only time the RMSD calculations themselves, and exclude the loading of the molecules (
io.loadmol
).Thanks! Would you mind sharing those benchmarks as well, just for completeness?
Oh yes of course! The dataset I used can be found here: https://zenodo.org/records/10257101/files/DUDE_GCR_3bqd.zip?download=1, where I used the Gnina (VS_GN_...) and LeadFinder (VS_LF_...) results.
The scripts I used can be found here: benchmark_scripts.zip
I'll have a proper look at the code later or in the coming days, but everything sounds great! Thanks a lot for the investigation and thorough report. The only thing I'd say is to keep
pebble
as an optional dependency anyways, since it's not critical for the functionality ofspyrmsd
, but a great addition. Given this, I'd maybe move everything related to this into aspyrmsd.parallel
module, so that one would dofrom spyrmsd.parallel import prmsdwrapper
The optional dependency can be done at the module level with something like the following:
try: import pebble except: errmsg = ("Parallel execution of SPyRMSD (`prmsdwrapper`) requires `pebble`." "Please install `pebble`: https://github.com/noxdafox/pebble") raise ImportError(errmsg)
Oh that sounds like a good idea. I'll try to implement it like that later today. I guess that with that we can even leave rmsd.py completely untouched in this PR!
As for reviewing, only do it when you have the time, there is no rush!
PS: Thanks to your PR on backend selection I'm simplifying CI and realised the test/molecules.py is really bad (it lead to bad design choices), and problematic when used with @pytest.parametrize and multiple backends. Depending on which PR gets in first, one or the other will need some tweaking of the tests. If this is too annoying for you, I can hold off the other PR until this is done (although CI would be much faster on this PR with the other one in). Just let me know.
Really glad to hear that! You can go ahead, we can fix up this later. Especially since with the pebble implementation this code is really lightweight anyway!
I'll try to implement it like that later today. I guess that with that we can even leave rmsd.py completely untouched in this PR!
Sounds great. The tests can then be moved to test_parallel.py
so that it's everything is orthogonal, which is a great design IMO.
I'll do a proper review when you do the move and remove the old code, but from a first look it looks good.
You can go ahead, we can fix up this later.
Ok, thanks. I think I'll merge soon, then leave a few comments here to the steps needed to make this PR work with the refactoring (it should be only a couple of things after all).
I just pushed the commit where I reverted rmsd.py and test_rmsd.py. I reverted to previous commits of these files, so it should be fine. I added a separate parallel.py and test_parallel.py. One thing I'm not too sure about is the "reporting"/printing with prmsdwrapper, especially when the chunks are larger than 1. Also I'm not sure if there should be more tests? Also, I suppose I need to merge now with the latest main branch to make the new changes for the tests?
I added a separate parallel.py and test_parallel.py.
Perfect, thanks!
One thing I'm not too sure about is the "reporting"/printing with prmsdwrapper, especially when the chunks are larger than 1.
I've seen you comment in the code. I'm not sure either, I will think a bit more about it. However, I would use warnings or errors instead of print()
.
Also I'm not sure if there should be more tests?
Let's see what the coverage says. If everything is covered, there should be no need. Or maybe it could be added to the long tests, but not sure it's worth it.
Also, I suppose I need to merge now with the latest main branch to make the new changes for the tests?
Yes, with develop
, not master
(which is only for releases). I left above a few comments on your tests for the changes needed to have them working with #118. Since your PR is orthogonal, it should be reasonably straightforward (I hope).
I wanted to merge the latest develop
branch with this branch via Github, but it's complaining about conflicting files (tests/molecules.py). However I can't resolve the conflict because I don't have write access. Can you do it? After the lasted develop
branch is merged, I will try to implement the changes from your latest review (probably tomorrow/later this week)!
@Jnelen usually you should be able to do it, since it would be modifying your own branch/fork and not the main repo. It turns out I can't do it either from the web interface, and I think it's because the tests/molecules.py
file has been deleted. You need to do it locally using git
's CLI.
@Jnelen usually you should be able to do it, since it would be modifying your own branch/fork and not the main repo. It turns out I can't do it either from the web interface, and I think it's because the
tests/molecules.py
file has been deleted. You need to do it locally usinggit
's CLI.
Yeah I just realized that now as well. I will try to do it this evening!
I merged the latest branch and updated the tests.
Honestly I took the updated tests from tests_rmsd which use rmsdwrapper
, and made them use prmsdwrapper
instead.
I also added an importskip in case Pebble isn't installed. However, maybe it'd be a good idea to include the installation of Pebble in the CI? If you agree I can add it to the yamls in the conda-envs (spyrmsd-test-....yaml)?
Thanks @Jnelen! I'll review the main function ASAP, sorry for the delay.
However, maybe it'd be a good idea to include the installation of Pebble in the CI?
Yes, it needs to be tested in CI. Since it's an optional dependency we need CI configurations both with and without. But since your design is completely orthogonal, I don't think we need to add too many. I'd suggest to use pebble
in the RDKit configurations and not use it with OpenBabel configurations, so that we don't expand the CI matrix. What do you think? The files to add it to would be spyrmsd-test-rdkit-all.yaml
and spyrmsd-test-rdkit-nogt.yaml
.
Thanks @Jnelen! I'll review the main function ASAP, sorry for the delay.
There is no rush at all! Please only do it when you have the time!
Yes, it needs to be tested in CI. Since it's an optional dependency we need CI configurations both with and without. But since your design is completely orthogonal, I don't think we need to add too many. I'd suggest to use
pebble
in the RDKit configurations and not use it with OpenBabel configurations, so that we don't expand the CI matrix. What do you think? The files to add it to would bespyrmsd-test-rdkit-all.yaml
andspyrmsd-test-rdkit-nogt.yaml
.
Perfect, I just added it!
I just pushed an update where I tried to address most of your comments. I think the main thing still is fixing the (potential) issues with chunksize >1 and adding more tests for prmsdwrapper (especially regarding chunksize..). I also opened an issue in the pebble repository to see if he can give some input (there was no discussion page), since it's quite confusing. In the absolute worst case scenario where we can't make it work properly we can restrict the chunksize I suppose, and make it flexible once we find a (proper) fix
Alright, I made some minor changes (I added a check as you suggested), and also changed the printing to a more compact warning. I'm thinking maybe add a test where the chunksize is tested (with and without a timeout)? What other concrete changes should be made in your opinion for our redefined scope?