openfe
openfe copied to clipboard
test larger cpu runner
Checklist
- [ ] Added a
newsentry
Developers certificate of origin
- [ ] I certify that this contribution is covered by the MIT License here and the Developer Certificate of Origin at https://developercertificate.org/.
Codecov Report
:x: Patch coverage is 10.00000% with 18 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 92.51%. Comparing base (192b582) to head (8a8b718).
:warning: Report is 239 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #1170 +/- ##
==========================================
- Coverage 94.66% 92.51% -2.16%
==========================================
Files 143 143
Lines 10994 11012 +18
==========================================
- Hits 10408 10188 -220
- Misses 586 824 +238
| Flag | Coverage Δ | |
|---|---|---|
| fast-tests | 92.51% <10.00%> (?) |
|
| slow-tests | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."
Good to know! Re-running now
Running here: https://github.com/OpenFreeEnergy/openfe/actions/runs/13597612702
large worked but timed out after 12 hours (which we can set up to 1 week) -- I will try non-intergration tests since AFAIK that is what @IAlibay is trying to run -- just the slow tests.
large worked but timed out after 12 hours (which we can set up to 1 week) -- I will try non-intergration tests since AFAIK that is what @IAlibay is trying to run -- just the slow tests.
Yeah runninng the "integration" tests is probably overkill without a GPU.
large:
============================= slowest 10 durations =============================
2655.53s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-1-False-11-1-3]
2496.48s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
2480.21s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
2453.59s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
2337.46s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
2298.25s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
2239.40s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[sams]
2214.30s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
2173.35s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
2111.31s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[independent]
=========================== short test summary info ============================
FAILED openfe/tests/utils/test_system_probe.py::test_probe_system_smoke_test - subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=gpu_uuid,gpu_name,compute_mode,pstate,temperature.gpu,utilization.memory,memory.total,driver_version,', '--format=csv']' returned non-zero exit status 9.
FAILED openfe/tests/protocols/test_openmm_rfe_slow.py::test_openmm_run_engine[CUDA] - openmm.OpenMMException: Error initializing CUDA: CUDA_ERROR_NO_DEVICE (100) at /home/conda/feedstock_root/build_artifacts/openmm_1726255919104/work/platforms/cuda/src/CudaContext.cpp:91
= 2 failed, 912 passed, 31 skipped, 2 xfailed, 3 xpassed, 1913 warnings, 3 rerun in 24749.25s (6:52:29) =
xlarge
============================= slowest 10 durations =============================
2509.67s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
2237.81s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[sams]
2151.15s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[independent]
1884.45s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
1808.82s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
1451.05s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
1449.02s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
1399.31s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_many_molecules_solvent
1388.60s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
1313.94s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
=========================== short test summary info ============================
FAILED openfe/tests/utils/test_system_probe.py::test_probe_system_smoke_test - subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=gpu_uuid,gpu_name,compute_mode,pstate,temperature.gpu,utilization.memory,memory.total,driver_version,', '--format=csv']' returned non-zero exit status 9.
FAILED openfe/tests/protocols/test_openmm_rfe_slow.py::test_openmm_run_engine[CUDA] - openmm.OpenMMException: Error initializing CUDA: CUDA_ERROR_NO_DEVICE (100) at /home/conda/feedstock_root/build_artifacts/openmm_1726255919104/work/platforms/cuda/src/CudaContext.cpp:91
= 2 failed, 912 passed, 31 skipped, 2 xfailed, 3 xpassed, 1978 warnings, 3 rerun in 11132.77s (3:05:32) =
better than 2x improvement
Last check, going to see if the intel flavor is any faster
@mikemhenry what flags are you using for these CPU runners? --runslow or --integration too? 3h seems wayy too long for just the slow tests.
integration as well -- I wanted to get some benchmarking data on the integration tests without a GPU
I actually turned off integration tests back in https://github.com/OpenFreeEnergy/openfe/pull/1170/commits/98cec71d28a0bed61d3ffbe433447ad0c66d31d6
But you right, that is kinda slow for just the slow tests
Now the runners are running out of disk space when installing the env, need to check if there are new deps making the env bigger or something else going on. I can also increase the EBS image size.
testing here https://github.com/OpenFreeEnergy/openfe/actions/runs/14044852203/job/39323509147
Sweet, getting:
FAILED openfe/tests/protocols/test_openmm_rfe_slow.py::test_openmm_run_engine[CUDA] - openmm.OpenMMException: Error initializing CUDA: CUDA_ERROR_NO_DEVICE (100) at /home/conda/feedstock_root/build_artifacts/openmm_1726255919104/work/platforms/cuda/src/CudaContext.cpp:91
But we expect that to fail, I am not sure why we are running this test since we only have OFE_SLOW_TESTS: "true" but no integration tests turned on, and it has a mark @pytest.mark.integration
timing info btw
= 5 failed, 936 passed, 28 skipped, 2 xfailed, 3 xpassed, 2010 warnings, 3 rerun in 11167.03s (3:06:07) =
I want to keep this PR open since something isn't quite right since see to be running more than just the slow tests
okay lets try this again now that we have fixed the round trip stuff
lol made you look
Almost have all the timing data I need, just need to add a skip on the GPU test
| AWS Instance Name | Cost ($/hr) | Test Duration | Test Cost |
|---|---|---|---|
| t3a.2xlarge | 0.3008 | 3h 2m 19s | $ 0.91 |
| t3a.xlarge | 0.1504 | 5h 1m 29s | $ 0.76 |
| t3.xlarge | 0.1664 | 4h 51m 34s | $ 0.81 |
| t3.2xlarge | 0.3328 | 3h 47m 22s | $ 1.26 |
| t3a.large | 0.0752 | 5h 36m 9s | $ 0.42 |
@IAlibay @atravitz I think we should go with the t3a.large option since it is the cheapest, I don't really care if it takes the longest -- thoughts? See this unsorted table https://github.com/OpenFreeEnergy/openfe/pull/1170#issuecomment-2920722067
What is the expected use-case and frequency of this runner? i.e., do you see this being used in our CI, or kept as manual-trigger only?
I'm so confused as to why these are taking so long on AWS. I can run the long tests on my workstation in order minutes.
Are we including --integration in this? For CPU runners it might be best we don't and just keep that for GPU runners?
Does your workstation have a GPU? How long does it take if you add a
CUDA_VISIBLE_DEVICES="" before the pytest command? My guess is we have some tests that are integration but marked slow. I will test locally
Oh I have an idea why!!!
@IAlibay how do you invoke the tests?
$ CUDA_VISIBLE_DEVICES="" pytest -n 2 -vv --durations=10 --runslow openfecli/tests/ openfe/tests/ this is taking more than minutes on my laptop
@IAlibay how do you invoke the tests?
$ CUDA_VISIBLE_DEVICES="" pytest -n 2 -vv --durations=10 --runslow openfecli/tests/ openfe/tests/this is taking more than minutes on my laptop
Testing right now with the CUDA_VISIBLE_DEVICES being set.
@mikemhenry it runs in 35 mins for me