openfe icon indicating copy to clipboard operation
openfe copied to clipboard

Some AWS CI updates

Open IAlibay opened this issue 1 year ago • 28 comments
trafficstars

Checklist

  • [ ] Added a news entry

Developers certificate of origin

IAlibay avatar Jul 29 '24 12:07 IAlibay

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.57%. Comparing base (a9d8fb5) to head (aaa83eb). Report is 127 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #910   +/-   ##
=======================================
  Coverage   92.57%   92.57%           
=======================================
  Files         134      134           
  Lines        9917     9917           
=======================================
  Hits         9181     9181           
  Misses        736      736           
Flag Coverage Δ
fast-tests 92.57% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.


🚨 Try these New Features:

codecov[bot] avatar Jul 29 '24 12:07 codecov[bot]

the c4.xlarege seemed fine IMHO = 2 failed, 855 passed, 40 skipped, 1 xfailed, 2 xpassed, 3594 warnings, 3 rerun in 4059.65s (1:07:39) = the tests ran super fast on it

mikemhenry avatar Jul 29 '24 15:07 mikemhenry

the c4.xlarege seemed fine IMHO = 2 failed, 855 passed, 40 skipped, 1 xfailed, 2 xpassed, 3594 warnings, 3 rerun in 4059.65s (1:07:39) = the tests ran super fast on it

@mikemhenry I'm so confused, are we looking at the same thing?

I'm seeing the c4.xlarge jobs at > 5h on your branch (4h+ on tests)

image

Also - fun story, c4.xlarge is more expensive per hour than c6 or c7 🙈

image

IAlibay avatar Jul 30 '24 02:07 IAlibay

@IAlibay it is this one https://github.com/OpenFreeEnergy/openfe/actions/runs/10143940924/job/28046616317#step:10:2709 which is the one from your PR where you changed to the faster instance -- I just got my wires crossed and copied and pasted what I thought was a c4 run with with the CPU affinity set manually. And it looks like as long as this instance isn't 4x the price (lol its not) then it saves us money spending more on the instance that takes less time.

I think at this point if you got a GPU quota bump, it is time to test on those since it looks like we have everything working CPU wise, and unless we want to use AWS for CPU tests, I don't think its worth optimizing this.

mikemhenry avatar Jul 31 '24 17:07 mikemhenry

then it saves us money spending more on the instance that takes less time.

@mikemhenry just to point out, I think the main win in the other PR is removing coverage not the instance type (needs testing, but seen the same thing locally).

IAlibay avatar Jul 31 '24 17:07 IAlibay

testing commit https://github.com/OpenFreeEnergy/openfe/pull/910/commits/036e08701ec007b0e122306b66d47187fc1a4ee5 here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10221938633

mikemhenry avatar Aug 02 '24 20:08 mikemhenry

botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. okay @IAlibay looks like you need to either make me an IAM account so I can request a quota increase for the G family of instances (4 vCPUs is all we need) or you will need to do it, I will see if we can run on the P family

mikemhenry avatar Aug 02 '24 20:08 mikemhenry

testing commit https://github.com/OpenFreeEnergy/openfe/pull/910/commits/4687c52552621ef2893c748a44ee6b73e60e4443 here https://github.com/OpenFreeEnergy/openfe/actions/runs/10221976998

mikemhenry avatar Aug 02 '24 21:08 mikemhenry

botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. looks like we didn't end up getting our quota request approved or something for the P family either?

mikemhenry avatar Aug 02 '24 21:08 mikemhenry

botocore.exceptions.ClientError: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. okay @IAlibay looks like you need to either make me an IAM account so I can request a quota increase for the G family of instances (4 vCPUs is all we need) or you will need to do it, I will see if we can run on the P family

This is odd, I would double check the quota amount on the G-instance types. I would advise against using the p2.xlarge as they are not compatible with the Deep Learning AMI we recommend and thus you will need a custom AMI (Eco Infra built one but the cost-benefit did not seem worthwhile).

ethanholz avatar Aug 05 '24 14:08 ethanholz

I don't have access to the AWS account so I am just doing what I can on the guess and check side of things in terms of which GPU quota we might have

and yes we want to avoid using a custom AMI, in my experience they are easy to setup but hard to keep up to date

mikemhenry avatar Aug 05 '24 19:08 mikemhenry

@mikemhenry looks like all my quota increases were in us-east-2 not us-east-1, which would explain things.

IAlibay avatar Aug 07 '24 12:08 IAlibay

okay giving this another shot here https://github.com/OpenFreeEnergy/openfe/actions/runs/10322623363

mikemhenry avatar Aug 09 '24 16:08 mikemhenry

You may need to update the AMI since they are region specific.

ethanholz avatar Aug 09 '24 16:08 ethanholz

ah at least it is a public ami so I should be able to fix it even though I don't have access

mikemhenry avatar Aug 09 '24 16:08 mikemhenry

okay test here https://github.com/OpenFreeEnergy/openfe/actions/runs/10322863477

mikemhenry avatar Aug 09 '24 16:08 mikemhenry

New test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10324157429/job/28583139693

Didn't have the shell setup correctly

mikemhenry avatar Aug 09 '24 18:08 mikemhenry

@IAlibay Do we want to do the integration tests as well? I was thinking of at least turning them on so we can get some timing data (after the current test finishes).

mikemhenry avatar Aug 09 '24 19:08 mikemhenry

Sweet, it ran in 1h 27m 20s so one minute longer than it took to run on a CPU only runner, but at least now we are checking our GPU code paths for ~75 cents

mikemhenry avatar Aug 09 '24 20:08 mikemhenry

I am going to run it again but this time do our integration tests as well to see how long they take

mikemhenry avatar Aug 09 '24 20:08 mikemhenry

longer tests running here https://github.com/OpenFreeEnergy/openfe/actions/runs/10326372977

mikemhenry avatar Aug 09 '24 22:08 mikemhenry

The integration tests are really the ones that have any kind of GPU testing. In many ways we might want to only run those.

IAlibay avatar Aug 09 '24 22:08 IAlibay

Yah I was thinking about is it worth only running GPU tests or all the tests? We might have to tweak our tests to support that option, and it will save us some money, but like, if it took an hour to do, how many CI runs do we need to break even?

mikemhenry avatar Aug 09 '24 22:08 mikemhenry

Long tests ran in 1h 53m 39s

 ============================= slowest 10 durations =============================
1899.24s call     openfe/tests/protocols/test_openmm_rfe_slow.py::test_run_eg5_sim
916.34s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
842.12s call     openfe/tests/protocols/test_openmmutils.py::TestOFFPartialCharge::test_am1bcc_conformer_nochange
824.47s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
762.52s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
712.34s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[independent]
658.25s call     openfe/tests/protocols/test_openmm_afe_slow.py::test_openmm_run_engine[CPU]
630.85s setup    openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::TestTyk2XmlRegression::test_particles
555.15s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
537.59s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
= 861 passed, 37 skipped, 1 xfailed, 2 xpassed, 5557 warnings in 6553.91s (1:49:13) =

It looks like most of the skips were from missing openeye and espaloma, we should probably install those packages so we can run everything

thoughts @IAlibay

mikemhenry avatar Aug 12 '24 16:08 mikemhenry

Added openeye and espaloma to the env, running test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390261526

mikemhenry avatar Aug 14 '24 15:08 mikemhenry

typo in channel name, running test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390460059

mikemhenry avatar Aug 14 '24 15:08 mikemhenry

Ran into this issue:

  critical libmamba Multiple errors occured:
      Download error (28) Timeout was reached [https://conda.anaconda.org/jaimergp/label/unsupported-cudatoolkit-shim/noarch/repodata.json.zst]
      Operation too slow. Less than 30 bytes/sec transferred the last 60 seconds
      Subdir jaimergp/label/unsupported-cudatoolkit-shim/noarch not loaded!

I removed the channel from our env.yaml, I will check to make sure it doesn't cause a large download in the other jobs.

New test here: https://github.com/OpenFreeEnergy/openfe/actions/runs/10390626657

mikemhenry avatar Aug 14 '24 15:08 mikemhenry

some tests are skipped:

2024-08-14T16:20:29.4649694Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip 
2024-08-14T16:20:29.4660788Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip_clear_registry 
2024-08-14T16:20:29.4676526Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_dict_roundtrip_clear_registry 
2024-08-14T16:20:29.4690414Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_keyed_dict_roundtrip 
2024-08-14T16:20:29.4705426Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_keyed_dict_roundtrip 
2024-08-14T16:20:29.4715536Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_shallow_dict_roundtrip 
2024-08-14T16:20:29.4730663Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_to_shallow_dict_roundtrip 
2024-08-14T16:20:29.4740311Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_key_stable 
2024-08-14T16:20:29.4755475Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_key_stable 
2024-08-14T16:20:29.4765931Z openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_repr 
2024-08-14T16:20:29.4779914Z [gw1] [ 88%] SKIPPED openfe/tests/protocols/test_rfe_tokenization.py::TestRelativeHybridTopologyProtocolResult::test_repr 

Will have to look into that but it took like 2 hours to run

2024-08-14T17:46:31.3838640Z ============================= slowest 10 durations =============================
2024-08-14T17:46:31.3840434Z 2023.28s call     openfe/tests/protocols/test_openmm_rfe_slow.py::test_run_eg5_sim
2024-08-14T17:46:31.3842885Z 883.92s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_aniline_mapping-0-1-False-11-4-1]
2024-08-14T17:46:31.3845828Z 740.35s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0-0-True-14-1-4]
2024-08-14T17:46:31.3848191Z 696.44s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[sams]
2024-08-14T17:46:31.3850752Z 691.12s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[aniline_to_benzene_mapping-0--1-False-11-1-4]
2024-08-14T17:46:31.3853070Z 681.27s call     openfe/tests/protocols/test_openmm_afe_slow.py::test_openmm_run_engine[CPU]
2024-08-14T17:46:31.3855546Z 651.47s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0-0-True-14-3-1]
2024-08-14T17:46:31.3858084Z 577.98s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex[repex]
2024-08-14T17:46:31.3860668Z 538.88s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzoic_to_benzene_mapping-0-0-True-14-1-3]
2024-08-14T17:46:31.3863590Z 531.53s call     openfe/tests/protocols/test_openmm_equil_rfe_protocols.py::test_dry_run_complex_alchemwater_totcharge[benzene_to_benzoic_mapping-0--1-False-11-3-1]
2024-08-14T17:46:31.3865818Z = 872 passed, 24 skipped, 4 xfailed, 5 xpassed, 5718 warnings in 6914.35s (1:55:14) =

I think now we just need to decide how often we want to schedule this run and it will be good to merge!

mikemhenry avatar Aug 15 '24 23:08 mikemhenry

@IAlibay any thoughts on how often we want to run this? It costs us ~$1.06 per run

mikemhenry avatar Aug 24 '24 01:08 mikemhenry

@IAlibay any thoughts on how often we want to run this? It costs us ~$1.06 per run

I think we need to have a chat / re-organize things - we're running a ton of CPU-only tests on a GPU runner, which seems like a bit of a waste (especially if we're doing this regularly).

We probably want to move GPU-only tests to a specific set of test files and only run those (e.g. the rfe slow tests).

IAlibay avatar Aug 26 '24 16:08 IAlibay