gatk
gatk copied to clipboard
Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment.
Copying over some discussion from Slack, with some slight modifications:
I took a quick stab at updating the environment for gCNV. Even taking out TensorFlow (assuming that the CNN will not be supported by this environment), it's a difficult task:
- The goal is to update Python from 3.6 to 3.10+, since Terra now requires the latter for officially supported images.
- However, gCNV relies on the PyMC3 package. PyMC3 3.1 is currently used in GATK master. 3.1 was released in 2017, not long before our release of gCNV in 2018, but it's very old now.
- The latest version of Python that is supported by PyMC3 3.1 in conda is Python 3.6.
- @asmirnov239 has a draft PR (#8094) that updates PyMC3 to 3.5 and Python to 3.7, which clearly still falls short of Python 3.10+. This PR also updated some gCNV code to make it compatible with PyMC3 3.5. (It also removed TensorFlow and added PyTorch.)
- @asmirnov239 also merged a PR that added tests for numerical reproducibility of GermlineCNVCaller in cohort mode in #7889.
- The earliest version of PyMC that supports Python 3.10+ is PyMC 4, released in 2022.
- However, PyMC 4 introduces API changes, which will also require additional gCNV code changes and numerical testing.
- These API changes are because the underlying computational backend for PyMC was updated from Theano (think of this as an old alternative to TensorFlow) to Aesara.
- Since then, PyMC 5.9 has been released and the underlying backend has been updated again, from Aesara to PyTensor.
- So if we are going to update the environment to support Python 3.10+, it probably makes sense to go all the way to PyMC 5.9.
I've made some strides in this PR; as of 6b08f3a, I've made enough updates to accommodate API changes so that cohort-mode inference for both GermlineCNVCaller and DetermineGermlineContigPloidy runs successfully under Python 3.10 and PyMC 5.9.0---although note that 5.9.1 has been released in the interim!
However, our work has just begun. Results now produced in the numerical tests mentioned above are quite far off from the original expected results. It remains to be seen whether this is due to the randomness of inference, some slight changes to the model prior that were necessitated by the API changes, or some bugs introduced in other code updates. (Also note that I believe Andrey's PR in item 4 already broke these tests, although the numerical differences were much smaller and more reasonable---but perhaps he can confirm. Also noting here that I think determinism is still currently broken as of this commit---there have been some changes to PyTensor/PyMC seeding so that our previous theano/PyMC3 hack no longer applies.)
So I think the next step is to just go to scientific-level testing and see what the fallout is. Ideally, we'd still get good performance (or perhaps better! at least on the runtime side, hopefully...) and we can just update the numerical tests. But if performance tanks, then we might need to see whether I've introduced any bugs. @mwalker174 @asmirnov239 perhaps you can comment on what might be the appropriate test suite here----1kGP?
I'll also highlight again that this PR will remove TensorFlow and might require that the corresponding CNN implementations be supported by an alternate strategy, at least until the PyTorch implementation goes in.
Github actions tests reported job failures from actions build 6618531552 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 6618531552.3 | logs |
Github actions tests reported job failures from actions build 6619431291 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 6619431291.3 | logs |
Github actions tests reported job failures from actions build 6619473068 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 6619473068.3 | logs |
Thanks for your work on this @samuelklee! Testing on both wes and wgs would be ideal. For wgs we can use the gatk-sv reference panel, which is our standard (I can help with this once a docker is ready). For wes, 1kgp would work although it's definitely showing its age. Are the integration test differences large?
Github actions tests reported job failures from actions build 6818599936 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 6818599936.3 | logs |
Github actions tests reported job failures from actions build 7109829354 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7109829354.3 | logs |
Github actions tests reported job failures from actions build 7114452949 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7114452949.3 | logs |
Github actions tests reported job failures from actions build 7116896802 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7116896802.3 | logs |
Github actions tests reported job failures from actions build 7116936117 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7116936117.3 | logs |
Github actions tests reported job failures from actions build 7132752076 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7132752076.3 | logs |
Github actions tests reported job failures from actions build 7133711844 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7133711844.3 | logs |
Github actions tests reported job failures from actions build 7135802130 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7135802130.3 | logs |
Github actions tests reported job failures from actions build 7137014657 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7137014657.3 | logs |
Github actions tests reported job failures from actions build 7142206385 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7142206385.3 | logs |
Github actions tests reported job failures from actions build 7142518880 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7142518880.3 | logs |
Github actions tests reported job failures from actions build 7142606113 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7142606113.3 | logs |
Github actions tests reported job failures from actions build 7142661462 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7142661462.3 | logs |
Github actions tests reported job failures from actions build 7142791118 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7142791118.3 | logs |
Github actions tests reported job failures from actions build 7143017940 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7143017940.3 | logs |
OK, I think things are looking good! Updated a bunch of things, including the following:
- conda 23.1.0 -> 23.10.0; in the base Docker, also disabled conda auto-updating and set the solver to the much faster libmamba (NOTE: before this PR went in, this change was actually made in https://github.com/broadinstitute/gatk/pull/8610)
- python 3.6.10 -> 3.10.13
- pymc 3.1 -> 5.10.0
- theano 1.0.4 -> pytensor 2.18.1
- added pytorch 2.1.0
- removed tensorflow 1.15.0 and other CNN dependencies
- added libblas-dev to the base Docker; I think MKL versions of all packages are being used, but we should verify!
These and other packages (numpy, scipy, etc.) are all pretty much at the latest available versions for python 3.10. I've also bumped version numbers for our internal python packages.
I also made all of the changes to the gCNV code to accommodate any changes introduced by PyMC/Pytensor. For the most part, these were minor renamings of theano/tt/etc. to pytensor/pt/etc.
However, there were some more nontrivial changes, including to 1) model priors (since some of the distributions previously used were removed or are now supported differently), 2) the implementation of posterior sampling, 3) some shape/dimshuffle operations, and other things along these lines.
Using a single test shard of 20 1kGP WES samples x 1000 intervals, I have verified determinism/reproducibility for DetermineGermlineContigPloidy COHORT/CASE modes, GermlineCNVCaller COHORT/CASE modes, and PostprocessGermlineCNVCalls. Numerical results are also relatively close to those from 4.4.0.0 for all identifiable call and model quantities (albeit far outside any reasonable exact-match thresholds, most likely due to differences in RNG, sampling, and the aforementioned priors).
Some remaining TODOs:
- [x] Rebuild and push the base Docker. EDIT: Mostly covered by #8610, but this also includes an addition of
libblas-dev. - [x] Update expected results for integration tests, perhaps add any that might be missing. EDIT: These were generated on WSL Ubuntu 20.04.2, we'll see if things pass on 22.04. Note that changing the ARD priors does change the names of the expected files, since the transform is appended to the corresponding variable name. DetermineGermlineContigPloidy and PostprocessGermlineCNVCalls are missing exact-match tests and should probably have some, but I'll leave that to someone else.
- [x] Update other python integration tests.
- [x] Clean up some of the changes to the priors.
- [x] Clean up some TODO comments that I left to track code changes that might result in changed numerics. I'll try to go through and convert these to PR comments in an initial review pass.
- [x] Test over multiple shards on WGS and WES. Probably some scientific tests on ~100 samples in both cohort and case mode would do the trick. We should also double check runtime/memory performance (I noted ~1.5x speedups, but didn't measure carefully; I also want to make sure the changes to posterior sampling didn't introduce any memory issues). @mwalker174 will ping you when a Docker is ready! Might be good to loop in Isaac and/or Jack as well.
- [x] Perhaps add back the fix for 2-interval shards in https://github.com/broadinstitute/gatk/pull/8180, which I removed since the required functionality wasn't immediately available in Pytensor. Not sure if this actually broke things though---need to check. (However, I don't actually think this is a very important use case to support...)
- [x] Delete/deprecate/etc. CNN tools/tests as appropriate. Note that this has to be done concurrently, since we remove Tensorflow. @droazen perhaps I can take a first stab at this in a subsequent commit to this PR once more of the gCNV dust settles and/or has undergone a preliminary review? EDIT: Disabled integration/WDL tests. We should add some deprecation messages to the tools---we can note that they should still work in previous environments but will be untested. I might set up a separate PR for deletion, to be done at the appropriate time (but I call dibs on this, can't have @davidbenjamin overtaking my all-time record for number of lines deleted 😛).
Github actions tests reported job failures from actions build 7143821808 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7143821808.3 | logs |
Github actions tests reported job failures from actions build 7183220084 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7183220084.3 | logs |
Github actions tests reported job failures from actions build 7184026360 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| integration | 17.0.6+10 | 7184026360.11 | logs |
Github actions tests reported job failures from actions build 7186584675 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| integration | 17.0.6+10 | 7186584675.11 | logs |
Github actions tests reported job failures from actions build 7186790110 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| integration | 17.0.6+10 | 7186790110.11 | logs |
| integration | 17.0.6+10 | 7186790110.11 | logs |
Github actions tests reported job failures from actions build 7188996998 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7188996998.3 | logs |
Github actions tests reported job failures from actions build 7189321306 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7189321306.3 | logs |
Github actions tests reported job failures from actions build 7190799987 Failures in the following jobs:
| Test Type | JDK | Job ID | Logs |
|---|---|---|---|
| conda | 17.0.6+10 | 7190799987.3 | logs |
Gonna do some rebasing and splitting into commits, copying some bits from the current git log here for posterity:
commit 28174871c5b2e99a3044a30c0de24a63d2fee5bc (HEAD -> sl_python_version_update, origin/sl_python_version_update)
Author: Samuel Lee <[email protected]>
Date: Wed Dec 13 00:59:52 2023 -0500
update gCNV WDL tests
commit 31a204b9e900849b5a313e161893de34a2094bb0
Author: Samuel Lee <[email protected]>
Date: Wed Dec 13 00:14:34 2023 -0500
staged base rc1
commit 74f8fa724dfac142ccd7ac79a757c0e5ac3bb06c
Author: Samuel Lee <[email protected]>
Date: Wed Dec 13 00:01:38 2023 -0500
minor pymc/pytensor version upgrades, fix 2-interval edge case, update some theano docs
commit 9c9d0c570dd2712631739e0a9d41e90c4ccd3456
Author: Samuel Lee <[email protected]>
Date: Tue Dec 12 23:36:55 2023 -0500
update VETS expected, verbose conda env create, pin torch CPU MKL, add pysam, fixed more tests
commit c0a17dfcf9fa1139927570d2f16125bc15a2c19f
Author: Samuel Lee <[email protected]>
Date: Tue Dec 12 20:07:08 2023 -0500
fix CNV plotting
commit dd2dd503a92e6fbb5a49be6a88d2e813eb8bf85b
Author: Samuel Lee <[email protected]>
Date: Tue Dec 12 15:14:08 2023 -0500
update gCNV expected results, generated on WSL Ubuntu 20.04.2
commit 27d76e8f22d61df90eeb337e033ae128ce07ab90
Author: Samuel Lee <[email protected]>
Date: Tue Dec 12 14:53:04 2023 -0500
update python env integration tests
commit 348df9192235f7d1ea941d0b31e5c96acc0d6491
Author: Samuel Lee <[email protected]>
Date: Tue Dec 12 10:59:23 2023 -0500
disable CNN tests, add deprecation message
commit ed59372b4be226785af1d3fb1b1a39a9ad3b4f6a
Author: Samuel Lee <[email protected]>
Date: Tue Dec 12 09:55:24 2023 -0500
clean up rebase
commit 18e530db26f803ee46a0006843cb36d4ed4194b4
Author: Samuel Lee <[email protected]>
Date: Fri Dec 8 11:31:46 2023 -0500
postprocess fixed
commit f510c2e9f10d7066c15f1835669d676964b8a4cb
Author: Samuel Lee <[email protected]>
Date: Fri Dec 8 10:13:01 2023 -0500
fix deprecated np.int in optimizer
commit 939a032f356f2f8f67b5aae426fc427d1d1ea6c4
Author: Samuel Lee <[email protected]>
Date: Fri Dec 8 09:50:57 2023 -0500
remove unnecessary seeding in cohort denoising script
commit cf82ea5c99250f1784f8b1a9279e7dbb8841fa89
Author: Samuel Lee <[email protected]>
Date: Fri Dec 8 09:38:08 2023 -0500
add back setup.py files
commit 8348f546de6b3d32e1f02f6851730226c0dbffc9
Author: Samuel Lee <[email protected]>
Date: Fri Dec 8 09:37:09 2023 -0500
update pymc version in init
commit 850d60ef95b6126c05af9cd7c2cb528a306e1224
Author: Samuel Lee <[email protected]>
Date: Fri Dec 8 09:32:42 2023 -0500
added pip editable docs
commit 9c51b311442b0796ab1224213e83290caea0f93f
Author: Samuel Lee <[email protected]>
Date: Fri Dec 8 09:24:50 2023 -0500
whitespace
commit d9b180385168fdd1ef55cee8a1069fc1f7928f38
Author: Samuel Lee <[email protected]>
Date: Fri Dec 8 09:24:10 2023 -0500
update setup_gcnvkernel.py and pin pytensor
commit 7ccbd6da3d4afd1c987a66f6874bc4918495f943
Author: Samuel Lee <[email protected]>
Date: Fri Dec 8 09:20:49 2023 -0500
update conda in base and python packages
commit 693a1f9de10ee9950000abb83ef598cde82e026b
Author: Samuel Lee <[email protected]>
Date: Fri Dec 8 08:55:53 2023 -0500
clean up Mixture, sample seeding, safelog
commit 2b211d7ed7875c798020ceaeed865523f25c7096
Author: Samuel Lee <[email protected]>
Date: Fri Dec 8 00:28:54 2023 -0500
fixed all determinism, need to clean up seeds
commit 799228dd5fafa2bf5d57e65e9aab256cdfc4698a
Author: Samuel Lee <[email protected]>
Date: Thu Dec 7 22:22:18 2023 -0500
more dCR, Mixture
commit 124073d9af37c19573f90270c3cb0f4ae5ba4dd0
Author: Samuel Lee <[email protected]>
Date: Thu Dec 7 19:39:55 2023 -0500
fix dCR sampling?
commit f6871c9f12dbd52d84e78bba4f03d276ba1efb72
Author: Samuel Lee <[email protected]>
Date: Thu Dec 7 15:34:26 2023 -0500
logsumexp cleanup
commit 0c6cba790d8c3566a1a872d292e2b7acc692670e
Author: Samuel Lee <[email protected]>
Date: Thu Dec 7 13:57:16 2023 -0500
revert backport of MeanField
commit 67fb373687d32ec7e0fa329d4f3864e2cccabc53
Author: Samuel Lee <[email protected]>
Date: Wed Dec 6 10:45:11 2023 -0500
just used sample_node, finally deterministic?!
commit 754c424566cb9c191b9775b02d464488d7f68f68
Author: Samuel Lee <[email protected]>
Date: Wed Dec 6 07:22:51 2023 -0500
port stable logsumexp, though it doesn't seem to make a difference in ploidy
commit 95c944d792642ba5ec8dceaaff67d3a35cb3eab0
Author: Samuel Lee <[email protected]>
Date: Tue Dec 5 23:51:01 2023 -0500
working with Mixture, still nondeterministic
commit 426375ac6473999ba249e775f2aa7d622534d510
Author: Samuel Lee <[email protected]>
Date: Tue Dec 5 22:32:44 2023 -0500
stochastic node cleanup
commit dc66f3f6a07dc82ba4258b3c0a65d990355da8d7
Author: Samuel Lee <[email protected]>
Date: Tue Dec 5 21:38:52 2023 -0500
safe log in log ploidy priors
commit 9712fa169a08edcf1a7a56622708e610b862631e
Author: Samuel Lee <[email protected]>
Date: Tue Dec 5 21:35:05 2023 -0500
still debugging stochastic node
commit a79762c616f8758e0fd073b9e8e5d1ad30c1d5d0
Author: Samuel Lee <[email protected]>
Date: Thu Nov 9 16:48:05 2023 -0500
blas in base, conda with libmamba solver
commit b4f5301c28a03689fca0b95ef652a51dd991686d
Author: Samuel Lee <[email protected]>
Date: Thu Nov 9 12:45:53 2023 -0500
freeze conda and set libmamba in base
commit f57a13a08fc3d049f0271e9aa94639ecb87b50f2
Author: Samuel Lee <[email protected]>
Date: Thu Nov 9 07:29:30 2023 -0500
libmamba 23.9.0
commit 45058f27aae9c9240a167f126e32a6bddd3353ff
Author: Samuel Lee <[email protected]>
Date: Wed Nov 8 23:33:51 2023 -0500
conda 23.9.0
commit d95494d282dd42ea3377de6c2d3554a3a5db65e4
Author: Samuel Lee <[email protected]>
Date: Mon Oct 30 17:04:34 2023 -0400
minor fixes to get ploidy working
commit c6a21f33f1f17b2bc9b878d8d6bfc6d7da784933
Author: Samuel Lee <[email protected]>
Date: Mon Oct 30 16:33:22 2023 -0400
truncated normal fixes
commit 81e3fff0a099731730d1657fc12f56abdc2b2927
Author: Samuel Lee <[email protected]>
Date: Mon Oct 30 16:03:02 2023 -0400
minor clip fix
commit 2fb94e88ab457ec037d304f7a80384b70a8685cc
Author: Samuel Lee <[email protected]>
Date: Mon Oct 30 15:17:06 2023 -0400
use model.rvs_to_values for registry; inference completes but numerical tests seem off!
commit 5d2062f3301f682dbf427d898521ec6da8338944
Author: Samuel Lee <[email protected]>
Date: Mon Oct 30 15:03:40 2023 -0400
VarMap and transformed name in register, exact match error to warn
commit 3dca890329396c9afae52aa8779cdfb97c18fa7c
Author: Samuel Lee <[email protected]>
Date: Wed Oct 25 23:28:32 2023 -0400
inference works! oh wow
commit d056855608fdf222a38a25d8de41f11fa1e82c22
Author: Samuel Lee <[email protected]>
Date: Tue Oct 24 22:11:30 2023 -0400
past full denoising instantiation and through to first calling epoch before hitting another shape error
commit fcfdb29e79e0652d24a5f509ade6e941f84293c2
Author: Samuel Lee <[email protected]>
Date: Tue Oct 24 00:17:46 2023 -0400
revert pytensorf
commit 2ec79b362bdfcd34b3ca1736e6327ecbbe2269ad
Author: Samuel Lee <[email protected]>
Date: Tue Oct 24 00:16:04 2023 -0400
HalfFlat typo
commit 6eb1594d62ed62b992db04c523ffc598845d3b4c
Author: Samuel Lee <[email protected]>
Date: Mon Oct 23 23:46:42 2023 -0400
added TODOs for nontrivial changes
commit c01754082991c006c8d56f0027138948d792ce75
Author: Samuel Lee <[email protected]>
Date: Mon Oct 23 23:33:43 2023 -0400
warmup denoising inference runs, a minor miracle! probably incorrect after some nontrivial changes, though
commit 3bf1c5b65d12fe76e499d1b43e58736e31f25f33
Author: Samuel Lee <[email protected]>
Date: Mon Oct 23 17:13:04 2023 -0400
find and replace, some version bumps
commit 3039e1f01579003518200b9033c5674c3d1de156
Author: Samuel Lee <[email protected]>
Date: Mon Oct 23 14:30:32 2023 -0400
python 3.10.9, pymc3 3.11.2
commit 4a2c86391990fb01d2e67fc7cf823495eef1ba99
Author: Andrey Smirnov <[email protected]>
Date: Sun Oct 23 03:58:36 2022 +0000
Updates to gCNV code for pymc3 upgrades.
Hi @samuelklee,
Any updates on this PR? Will this be able to get merged in the foreseeable future?
Thanks M