openfe icon indicating copy to clipboard operation
openfe copied to clipboard

test larger cpu runner

Open mikemhenry opened this issue 8 months ago • 21 comments
trafficstars

Checklist

  • [ ] Added a news entry

Developers certificate of origin

mikemhenry avatar Feb 28 '25 21:02 mikemhenry

Codecov Report

:x: Patch coverage is 10.00000% with 18 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 92.51%. Comparing base (192b582) to head (8a8b718). :warning: Report is 239 commits behind head on main.

Files with missing lines Patch % Lines
openfe/tests/protocols/conftest.py 11.11% 16 Missing :warning:
...enfe/tests/protocols/openmm_ahfe/test_ahfe_slow.py 0.00% 1 Missing :warning:
...tests/protocols/openmm_rfe/test_hybrid_top_slow.py 0.00% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1170      +/-   ##
==========================================
- Coverage   94.66%   92.51%   -2.16%     
==========================================
  Files         143      143              
  Lines       10994    11012      +18     
==========================================
- Hits        10408    10188     -220     
- Misses        586      824     +238     
Flag Coverage Δ
fast-tests 92.51% <10.00%> (?)
slow-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Feb 28 '25 21:02 codecov[bot]

"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."

Good to know! Re-running now

mikemhenry avatar Feb 28 '25 22:02 mikemhenry

Running here: https://github.com/OpenFreeEnergy/openfe/actions/runs/13597612702

mikemhenry avatar Feb 28 '25 22:02 mikemhenry

large worked but timed out after 12 hours (which we can set up to 1 week) -- I will try non-intergration tests since AFAIK that is what @IAlibay is trying to run -- just the slow tests.

mikemhenry avatar Mar 05 '25 16:03 mikemhenry

large worked but timed out after 12 hours (which we can set up to 1 week) -- I will try non-intergration tests since AFAIK that is what @IAlibay is trying to run -- just the slow tests.

Yeah runninng the "integration" tests is probably overkill without a GPU.

IAlibay avatar Mar 05 '25 17:03 IAlibay

large:

============================= slowest 10 durations =============================
2655.53s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-1-False-11-1-3]
2496.48s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
2480.21s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
2453.59s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
2337.46s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
2298.25s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
2239.40s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[sams]
2214.30s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
2173.35s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
2111.31s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[independent]
=========================== short test summary info ============================
FAILED openfe/tests/utils/test_system_probe.py::test_probe_system_smoke_test - subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=gpu_uuid,gpu_name,compute_mode,pstate,temperature.gpu,utilization.memory,memory.total,driver_version,', '--format=csv']' returned non-zero exit status 9.
FAILED openfe/tests/protocols/test_openmm_rfe_slow.py::test_openmm_run_engine[CUDA] - openmm.OpenMMException: Error initializing CUDA: CUDA_ERROR_NO_DEVICE (100) at /home/conda/feedstock_root/build_artifacts/openmm_1726255919104/work/platforms/cuda/src/CudaContext.cpp:91
= 2 failed, 912 passed, 31 skipped, 2 xfailed, 3 xpassed, 1913 warnings, 3 rerun in 24749.25s (6:52:29) =

mikemhenry avatar Mar 10 '25 20:03 mikemhenry

xlarge

============================= slowest 10 durations =============================
2509.67s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
2237.81s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[sams]
2151.15s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[independent]
1884.45s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
1808.82s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
1451.05s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
1449.02s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
1399.31s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_many_molecules_solvent
1388.60s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
1313.94s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
=========================== short test summary info ============================
FAILED openfe/tests/utils/test_system_probe.py::test_probe_system_smoke_test - subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=gpu_uuid,gpu_name,compute_mode,pstate,temperature.gpu,utilization.memory,memory.total,driver_version,', '--format=csv']' returned non-zero exit status 9.
FAILED openfe/tests/protocols/test_openmm_rfe_slow.py::test_openmm_run_engine[CUDA] - openmm.OpenMMException: Error initializing CUDA: CUDA_ERROR_NO_DEVICE (100) at /home/conda/feedstock_root/build_artifacts/openmm_1726255919104/work/platforms/cuda/src/CudaContext.cpp:91
= 2 failed, 912 passed, 31 skipped, 2 xfailed, 3 xpassed, 1978 warnings, 3 rerun in 11132.77s (3:05:32) =

mikemhenry avatar Mar 12 '25 23:03 mikemhenry

better than 2x improvement

mikemhenry avatar Mar 12 '25 23:03 mikemhenry

Last check, going to see if the intel flavor is any faster

mikemhenry avatar Mar 12 '25 23:03 mikemhenry

@mikemhenry what flags are you using for these CPU runners? --runslow or --integration too? 3h seems wayy too long for just the slow tests.

IAlibay avatar Mar 13 '25 00:03 IAlibay

integration as well -- I wanted to get some benchmarking data on the integration tests without a GPU

mikemhenry avatar Mar 13 '25 20:03 mikemhenry

I actually turned off integration tests back in https://github.com/OpenFreeEnergy/openfe/pull/1170/commits/98cec71d28a0bed61d3ffbe433447ad0c66d31d6

mikemhenry avatar Mar 13 '25 22:03 mikemhenry

But you right, that is kinda slow for just the slow tests

mikemhenry avatar Mar 13 '25 23:03 mikemhenry

Now the runners are running out of disk space when installing the env, need to check if there are new deps making the env bigger or something else going on. I can also increase the EBS image size.

mikemhenry avatar Mar 14 '25 14:03 mikemhenry

testing here https://github.com/OpenFreeEnergy/openfe/actions/runs/14044852203/job/39323509147

mikemhenry avatar Mar 24 '25 21:03 mikemhenry

Sweet, getting: FAILED openfe/tests/protocols/test_openmm_rfe_slow.py::test_openmm_run_engine[CUDA] - openmm.OpenMMException: Error initializing CUDA: CUDA_ERROR_NO_DEVICE (100) at /home/conda/feedstock_root/build_artifacts/openmm_1726255919104/work/platforms/cuda/src/CudaContext.cpp:91 But we expect that to fail, I am not sure why we are running this test since we only have OFE_SLOW_TESTS: "true" but no integration tests turned on, and it has a mark @pytest.mark.integration

mikemhenry avatar Mar 24 '25 23:03 mikemhenry

timing info btw = 5 failed, 936 passed, 28 skipped, 2 xfailed, 3 xpassed, 2010 warnings, 3 rerun in 11167.03s (3:06:07) =

mikemhenry avatar Mar 24 '25 23:03 mikemhenry

I want to keep this PR open since something isn't quite right since see to be running more than just the slow tests

mikemhenry avatar Mar 25 '25 14:03 mikemhenry

okay lets try this again now that we have fixed the round trip stuff

mikemhenry avatar Mar 25 '25 19:03 mikemhenry

lol made you look

mikemhenry avatar Mar 25 '25 19:03 mikemhenry

Almost have all the timing data I need, just need to add a skip on the GPU test

mikemhenry avatar May 29 '25 14:05 mikemhenry

AWS Instance Name Cost ($/hr) Test Duration Test Cost
t3a.2xlarge 0.3008 3h 2m 19s $ 0.91
t3a.xlarge 0.1504 5h 1m 29s $ 0.76
t3.xlarge 0.1664 4h 51m 34s $ 0.81
t3.2xlarge 0.3328 3h 47m 22s $ 1.26
t3a.large 0.0752 5h 36m 9s $ 0.42

mikemhenry avatar May 29 '25 22:05 mikemhenry

@IAlibay @atravitz I think we should go with the t3a.large option since it is the cheapest, I don't really care if it takes the longest -- thoughts? See this unsorted table https://github.com/OpenFreeEnergy/openfe/pull/1170#issuecomment-2920722067

mikemhenry avatar May 30 '25 16:05 mikemhenry

What is the expected use-case and frequency of this runner? i.e., do you see this being used in our CI, or kept as manual-trigger only?

atravitz avatar May 30 '25 17:05 atravitz

I'm so confused as to why these are taking so long on AWS. I can run the long tests on my workstation in order minutes.

Are we including --integration in this? For CPU runners it might be best we don't and just keep that for GPU runners?

IAlibay avatar May 30 '25 17:05 IAlibay

image Does your workstation have a GPU? How long does it take if you add a CUDA_VISIBLE_DEVICES="" before the pytest command? My guess is we have some tests that are integration but marked slow. I will test locally

mikemhenry avatar May 30 '25 17:05 mikemhenry

Oh I have an idea why!!!

mikemhenry avatar May 30 '25 17:05 mikemhenry

@IAlibay how do you invoke the tests? $ CUDA_VISIBLE_DEVICES="" pytest -n 2 -vv --durations=10 --runslow openfecli/tests/ openfe/tests/ this is taking more than minutes on my laptop

mikemhenry avatar May 30 '25 18:05 mikemhenry

@IAlibay how do you invoke the tests? $ CUDA_VISIBLE_DEVICES="" pytest -n 2 -vv --durations=10 --runslow openfecli/tests/ openfe/tests/ this is taking more than minutes on my laptop

Testing right now with the CUDA_VISIBLE_DEVICES being set.

IAlibay avatar May 30 '25 18:05 IAlibay

@mikemhenry it runs in 35 mins for me

IAlibay avatar May 30 '25 19:05 IAlibay