openfe
openfe copied to clipboard
Some AWS CI updates
Checklist
- [ ] Added a
newsentry
Developers certificate of origin
- [ ] I certify that this contribution is covered by the MIT License here and the Developer Certificate of Origin at https://developercertificate.org/.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 92.57%. Comparing base (
a9d8fb5) to head (aaa83eb). Report is 127 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #910 +/- ##
=======================================
Coverage 92.57% 92.57%
=======================================
Files 134 134
Lines 9917 9917
=======================================
Hits 9181 9181
Misses 736 736
| Flag | Coverage Δ | |
|---|---|---|
| fast-tests | 92.57% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
🚨 Try these New Features:
- Flaky Tests Detection - Detect and resolve failed and flaky tests
the c4.xlarege seemed fine IMHO = 2 failed, 855 passed, 40 skipped, 1 xfailed, 2 xpassed, 3594 warnings, 3 rerun in 4059.65s (1:07:39) = the tests ran super fast on it
the
c4.xlaregeseemed fine IMHO= 2 failed, 855 passed, 40 skipped, 1 xfailed, 2 xpassed, 3594 warnings, 3 rerun in 4059.65s (1:07:39) =the tests ran super fast on it
@mikemhenry I'm so confused, are we looking at the same thing?
I'm seeing the c4.xlarge jobs at > 5h on your branch (4h+ on tests)
Also - fun story, c4.xlarge is more expensive per hour than c6 or c7 🙈
@IAlibay it is this one https://github.com/OpenFreeEnergy/openfe/actions/runs/10143940924/job/28046616317#step:10:2709 which is the one from your PR where you changed to the faster instance -- I just got my wires crossed and copied and pasted what I thought was a c4 run with with the CPU affinity set manually. And it looks like as long as this instance isn't 4x the price (lol its not) then it saves us money spending more on the instance that takes less time.
I think at this point if you got a GPU quota bump, it is time to test on those since it looks like we have everything working CPU wise, and unless we want to use AWS for CPU tests, I don't think its worth optimizing this.
then it saves us money spending more on the instance that takes less time.
@mikemhenry just to point out, I think the main win in the other PR is removing coverage not the instance type (needs testing, but seen the same thing locally).
testing commit https://github.com/OpenFreeEnergy/openfe/pull/910/commits/036e08701ec007b0e122306b66d47187fc1a4ee5 here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10221938633
botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. okay @IAlibay looks like you need to either make me an IAM account so I can request a quota increase for the G family of instances (4 vCPUs is all we need) or you will need to do it, I will see if we can run on the P family
testing commit https://github.com/OpenFreeEnergy/openfe/pull/910/commits/4687c52552621ef2893c748a44ee6b73e60e4443 here https://github.com/OpenFreeEnergy/openfe/actions/runs/10221976998
botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. looks like we didn't end up getting our quota request approved or something for the P family either?
botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.okay @IAlibay looks like you need to either make me an IAM account so I can request a quota increase for the G family of instances (4 vCPUs is all we need) or you will need to do it, I will see if we can run on the P family
This is odd, I would double check the quota amount on the G-instance types. I would advise against using the p2.xlarge as they are not compatible with the Deep Learning AMI we recommend and thus you will need a custom AMI (Eco Infra built one but the cost-benefit did not seem worthwhile).
I don't have access to the AWS account so I am just doing what I can on the guess and check side of things in terms of which GPU quota we might have
and yes we want to avoid using a custom AMI, in my experience they are easy to setup but hard to keep up to date
@mikemhenry looks like all my quota increases were in us-east-2 not us-east-1, which would explain things.
okay giving this another shot here https://github.com/OpenFreeEnergy/openfe/actions/runs/10322623363
You may need to update the AMI since they are region specific.
ah at least it is a public ami so I should be able to fix it even though I don't have access
okay test here https://github.com/OpenFreeEnergy/openfe/actions/runs/10322863477
New test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10324157429/job/28583139693
Didn't have the shell setup correctly
@IAlibay Do we want to do the integration tests as well? I was thinking of at least turning them on so we can get some timing data (after the current test finishes).
Sweet, it ran in 1h 27m 20s so one minute longer than it took to run on a CPU only runner, but at least now we are checking our GPU code paths for ~75 cents
I am going to run it again but this time do our integration tests as well to see how long they take
longer tests running here https://github.com/OpenFreeEnergy/openfe/actions/runs/10326372977
The integration tests are really the ones that have any kind of GPU testing. In many ways we might want to only run those.
Yah I was thinking about is it worth only running GPU tests or all the tests? We might have to tweak our tests to support that option, and it will save us some money, but like, if it took an hour to do, how many CI runs do we need to break even?
Long tests ran in 1h 53m 39s
============================= slowest 10 durations =============================
1899.24s call openfe/tests/protocols/test_openmm_rfe_slow.py::test_run_eg5_sim
916.34s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
842.12s call openfe/tests/protocols/test_openmmutils.py::TestOFFPartialCharge::test_am1bcc_conformer_nochange
824.47s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
762.52s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
712.34s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[independent]
658.25s call openfe/tests/protocols/test_openmm_afe_slow.py::test_openmm_run_engine[CPU]
630.85s setup openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::TestTyk2XmlRegression::test_particles
555.15s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
537.59s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
= 861 passed, 37 skipped, 1 xfailed, 2 xpassed, 5557 warnings in 6553.91s (1:49:13) =
It looks like most of the skips were from missing openeye and espaloma, we should probably install those packages so we can run everything
thoughts @IAlibay
Added openeye and espaloma to the env, running test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390261526
typo in channel name, running test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390460059
Ran into this issue:
critical libmamba Multiple errors occured:
Download error (28) Timeout was reached [https://conda.anaconda.org/jaimergp/label/unsupported-cudatoolkit-shim/noarch/repodata.json.zst]
Operation too slow. Less than 30 bytes/sec transferred the last 60 seconds
Subdir jaimergp/label/unsupported-cudatoolkit-shim/noarch not loaded!
I removed the channel from our env.yaml, I will check to make sure it doesn't cause a large download in the other jobs.
New test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390626657
some tests are skipped:
2024-08-14T16:20:29.4649694Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip
2024-08-14T16:20:29.4660788Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip_clear_registry
2024-08-14T16:20:29.4676526Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip_clear_registry
2024-08-14T16:20:29.4690414Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_keyed_dict_roundtrip
2024-08-14T16:20:29.4705426Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_keyed_dict_roundtrip
2024-08-14T16:20:29.4715536Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_shallow_dict_roundtrip
2024-08-14T16:20:29.4730663Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_shallow_dict_roundtrip
2024-08-14T16:20:29.4740311Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_key_stable
2024-08-14T16:20:29.4755475Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_key_stable
2024-08-14T16:20:29.4765931Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_repr
2024-08-14T16:20:29.4779914Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_repr
Will have to look into that but it took like 2 hours to run
2024-08-14T17:46:31.3838640Z ============================= slowest 10 durations =============================
2024-08-14T17:46:31.3840434Z 2023.28s call openfe/tests/protocols/test_openmm_rfe_slow.py::test_run_eg5_sim
2024-08-14T17:46:31.3842885Z 883.92s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
2024-08-14T17:46:31.3845828Z 740.35s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
2024-08-14T17:46:31.3848191Z 696.44s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[sams]
2024-08-14T17:46:31.3850752Z 691.12s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
2024-08-14T17:46:31.3853070Z 681.27s call openfe/tests/protocols/test_openmm_afe_slow.py::test_openmm_run_engine[CPU]
2024-08-14T17:46:31.3855546Z 651.47s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
2024-08-14T17:46:31.3858084Z 577.98s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
2024-08-14T17:46:31.3860668Z 538.88s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
2024-08-14T17:46:31.3863590Z 531.53s call openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
2024-08-14T17:46:31.3865818Z = 872 passed, 24 skipped, 4 xfailed, 5 xpassed, 5718 warnings in 6914.35s (1:55:14) =
I think now we just need to decide how often we want to schedule this run and it will be good to merge!
@IAlibay any thoughts on how often we want to run this? It costs us ~$1.06 per run
@IAlibay any thoughts on how often we want to run this? It costs us ~$1.06 per run
I think we need to have a chat / re-organize things - we're running a ton of CPU-only tests on a GPU runner, which seems like a bit of a waste (especially if we're doing this regularly).
We probably want to move GPU-only tests to a specific set of test files and only run those (e.g. the rfe slow tests).