alphafold icon indicating copy to clipboard operation
alphafold copied to clipboard

Multimer predictions degraded after updating from 2.2.3 to 2.2.4?

Open lucajovine opened this issue 1 year ago • 30 comments

Hello, this is ranking_debug.json from a test multimer run that I had performed with version 2.2.3:

{ "iptm+ptm": { "model_1_multimer_v2_pred_0": 0.8584136076107545, "model_2_multimer_v2_pred_0": 0.8055674020453231, "model_3_multimer_v2_pred_0": 0.6845710872128711, "model_4_multimer_v2_pred_0": 0.8423123043127994, "model_5_multimer_v2_pred_0": 0.8676285262204777 }, "order": [ "model_5_multimer_v2_pred_0", "model_1_multimer_v2_pred_0", "model_4_multimer_v2_pred_0", "model_2_multimer_v2_pred_0", "model_3_multimer_v2_pred_0" ] }

and this is from an equivalent job, carried out using the same exact input, after updating to 2.2.4:

{ "iptm+ptm": { "model_1_multimer_v2_pred_0": 0.40774006137839586, "model_2_multimer_v2_pred_0": 0.44277279108467366, "model_3_multimer_v2_pred_0": 0.442618376771567, "model_4_multimer_v2_pred_0": 0.5631646773883024, "model_5_multimer_v2_pred_0": 0.48998768122405234 }, "order": [ "model_4_multimer_v2_pred_0", "model_5_multimer_v2_pred_0", "model_2_multimer_v2_pred_0", "model_3_multimer_v2_pred_0", "model_1_multimer_v2_pred_0" ] }

Is anyone else experiencing the same? I am not getting any obvious error upon running 2.2.4, so it's not clear to me why there should be such a significant difference...

lucajovine avatar Sep 22 '22 16:09 lucajovine

Hi thanks for raising this. Can you please provide more details of the test run so this can be reproduced?

On Thu, 22 Sept 2022 at 17:12, Luca Jovine @.***> wrote:

Hello, this is ranking_debug.json from a test multimer run that I had performed with version 2.2.3:

{ "iptm+ptm": { "model_1_multimer_v2_pred_0": 0.8584136076107545, "model_2_multimer_v2_pred_0": 0.8055674020453231, "model_3_multimer_v2_pred_0": 0.6845710872128711, "model_4_multimer_v2_pred_0": 0.8423123043127994, "model_5_multimer_v2_pred_0": 0.8676285262204777 }, "order": [ "model_5_multimer_v2_pred_0", "model_1_multimer_v2_pred_0", "model_4_multimer_v2_pred_0", "model_2_multimer_v2_pred_0", "model_3_multimer_v2_pred_0" ] }

and this is from an equivalent job, carried out using the same exact input, after updating to 2.2.4:

{ "iptm+ptm": { "model_1_multimer_v2_pred_0": 0.40774006137839586, "model_2_multimer_v2_pred_0": 0.44277279108467366, "model_3_multimer_v2_pred_0": 0.442618376771567, "model_4_multimer_v2_pred_0": 0.5631646773883024, "model_5_multimer_v2_pred_0": 0.48998768122405234 }, "order": [ "model_4_multimer_v2_pred_0", "model_5_multimer_v2_pred_0", "model_2_multimer_v2_pred_0", "model_3_multimer_v2_pred_0", "model_1_multimer_v2_pred_0" ] }

Is anyone else experiencing the same? I am not getting any obvious error upon running 2.2.4, so it's not clear to me why there should be such a significant difference...

— Reply to this email directly, view it on GitHub https://github.com/deepmind/alphafold/issues/597, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADKS2BZQTT7WLUCNIVLWTRLV7SAOLANCNFSM6AAAAAAQTGX6SM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Htomlinson14 avatar Sep 22 '22 16:09 Htomlinson14

Sure, of course. This was my input FASTA:

>protein1
IDWDVYCSQDESIPAKFISRLVTSKDQALEKTEINCSNGLVPITQEFGINMMLIQYTRNELLDSPGMCVFWGPYSVPKNDTVVLYTVTARLKWSEGPPTNLSIQCYMPK
>protein2
RSWHYVEPKFLNKAFEVALKVQIIAGFDRGLVKWLRVHGRTLSTVQKKALYFVNRRYMQTHWANYMLWINKKIDALGRTPVVGDYTRLGAEIGRRIDMAYFYDFLKDKNMIPKYLPYMEEINRMRPADVPVKYM

(which basically corresponds to our PDB deposition 5IIB)

Flags were --db_preset=full_dbs --model_preset=multimer --num_multimer_predictions_per_model=1 --gpu_devices="0" --max_template_date=3000-01-01

lucajovine avatar Sep 22 '22 16:09 lucajovine

Also...

> python run_alphafold.py --helpshort
/usr/local/conda/miniconda3/envs/af2/lib/python3.7/site-packages/haiku/_src/data_structures.py:37: FutureWarning: jax.tree_structure is deprecated, and will be removed in a future release. Use jax.tree_util.tree_structure instead.
  PyTreeDef = type(jax.tree_structure(None))
Traceback (most recent call last):
  File "run_alphafold.py", line 39, in <module>
    from alphafold.relax import relax
  File "/media/3p5TBssd1/usr/local/alphafold/alphafold/alphafold/relax/relax.py", line 18, in <module>
    from alphafold.relax import amber_minimize
  File "/media/3p5TBssd1/usr/local/alphafold/alphafold/alphafold/relax/amber_minimize.py", line 25, in <module>
    from alphafold.relax import cleanup
  File "/media/3p5TBssd1/usr/local/alphafold/alphafold/alphafold/relax/cleanup.py", line 22, in <module>
    import pdbfixer
ModuleNotFoundError: No module named 'pdbfixer'

I guess this could be automatically installed via conda?

lucajovine avatar Sep 22 '22 22:09 lucajovine

just my 2 cents, I don't have the issue with

    --use_gpu_relax=true \
    --model_preset=multimer \
    --max_template_date=3000-01-01 \
    --db_preset=full_dbs

pasted sorted values: (2.2.4 on the left and 2.2.3 on the right)

       "model_5_multimer_v2_pred_1": 0.8600176497731321,               "model_2_multimer_v2_pred_4": 0.8423162898031145,
       "model_4_multimer_v2_pred_1": 0.8355726960868132,               "model_2_multimer_v2_pred_2": 0.8345795690267899,
       "model_5_multimer_v2_pred_0": 0.8106497232061523,               "model_4_multimer_v2_pred_4": 0.8274770532116257,
       "model_5_multimer_v2_pred_4": 0.7618542529543955                "model_5_multimer_v2_pred_1": 0.8100580241139267,
       "model_1_multimer_v2_pred_2": 0.7462422015606857,               "model_1_multimer_v2_pred_1": 0.7556708441527138,
       "model_5_multimer_v2_pred_3": 0.7262914654950798,               "model_1_multimer_v2_pred_2": 0.7530240249541638,
       "model_4_multimer_v2_pred_2": 0.6850626261632934,               "model_4_multimer_v2_pred_0": 0.7453822555113954,
       "model_1_multimer_v2_pred_4": 0.6504379081135182,               "model_1_multimer_v2_pred_3": 0.7074747369853926,
       "model_4_multimer_v2_pred_3": 0.6282972043309893,               "model_4_multimer_v2_pred_2": 0.6217676770784255,
       "model_4_multimer_v2_pred_0": 0.5659582021061171,               "model_5_multimer_v2_pred_2": 0.5892705180579751,
       "model_1_multimer_v2_pred_3": 0.5530272925711416,               "model_1_multimer_v2_pred_4": 0.5886967664945151,
       "model_1_multimer_v2_pred_1": 0.550091036479859,                "model_4_multimer_v2_pred_1": 0.5816013102240326,
       "model_2_multimer_v2_pred_0": 0.537335888385569,                "model_2_multimer_v2_pred_0": 0.517386942967343,
       "model_2_multimer_v2_pred_2": 0.534451564803337,                "model_1_multimer_v2_pred_0": 0.5026680309406077,
       "model_4_multimer_v2_pred_4": 0.5258449296466584,               "model_4_multimer_v2_pred_3": 0.48747237933686044,
       "model_2_multimer_v2_pred_3": 0.515407234197668,                "model_5_multimer_v2_pred_3": 0.46593239818276644,
       "model_3_multimer_v2_pred_1": 0.499675159211949,                "model_3_multimer_v2_pred_2": 0.43583198796278766,
       "model_2_multimer_v2_pred_4": 0.4734043448923909,               "model_2_multimer_v2_pred_1": 0.42842107067582047,
       "model_3_multimer_v2_pred_0": 0.46022060251713326,              "model_5_multimer_v2_pred_0": 0.4262880199002601,
       "model_1_multimer_v2_pred_0": 0.4401072457868955,               "model_3_multimer_v2_pred_1": 0.42030398029562555,
       "model_5_multimer_v2_pred_2": 0.4233857428348928,               "model_3_multimer_v2_pred_4": 0.41386853973521476,
       "model_3_multimer_v2_pred_4": 0.4185971424788965,               "model_3_multimer_v2_pred_0": 0.4004228507119986,
       "model_3_multimer_v2_pred_3": 0.4112920469041982,               "model_2_multimer_v2_pred_3": 0.381738114315064,
       "model_3_multimer_v2_pred_2": 0.3990563873655274,               "model_3_multimer_v2_pred_3": 0.3696506974209459,
       "model_2_multimer_v2_pred_1": 0.3888554107330705,               "model_5_multimer_v2_pred_4": 0.35258544121857827

truatpasteurdotfr avatar Sep 23 '22 12:09 truatpasteurdotfr

Thanks @lucajovine. Are you running this with the docker image? It should have all of the prerequisites installed. pdbfixer isn't available on pypi, so needs to be installed with conda.

Thanks @truatpasteurdotfr we also didn't see any issues in internal testing.

Htomlinson14 avatar Sep 23 '22 12:09 Htomlinson14

Hi, thank you - am re-installing 2.2.4 from scratch rather than pulling it into my existing installation (although this always worked fine in the past, maybe something odd happened this time). Will re-run the test after that and let you know what happened! @Htomlinson14 yes this was from the docker image

lucajovine avatar Sep 23 '22 12:09 lucajovine

Hello again, so I reinstalled from scratch 2.2.3 and 2.2.4 on two different machines and re-run 4 test jobs on each (two different monomers and two different multimers, with multimer_1 being the same as above). I am afraid that the problem persists, even though (1) it does not seem to happen with monomers and (2) it is much more evident for one multimer than the other:

monomer_1
	2.2.3
		"model_1_pred_0": 92.51882397715578,
		"model_2_pred_0": 92.2056299645637,
		"model_3_pred_0": 92.03138974894895,
		"model_4_pred_0": 93.1682804560119,
		"model_5_pred_0": 93.65241384700947
	2.2.4
		"model_1_pred_0": 92.53758910826187,
		"model_2_pred_0": 92.94215515721487,
		"model_3_pred_0": 93.52598782948384,
		"model_4_pred_0": 93.42160017797602,
		"model_5_pred_0": 93.70445135155603

monomer_2
	2.2.3
		"model_1_pred_0": 85.85987502475369,
		"model_2_pred_0": 84.90289928449474,
		"model_3_pred_0": 86.95139195928073,
		"model_4_pred_0": 86.64831516433784,
		"model_5_pred_0": 86.92184064848513
	2.2.4
		"model_1_pred_0": 85.94278137586902,
		"model_2_pred_0": 84.20654974770405,
		"model_3_pred_0": 86.83553028330901,
		"model_4_pred_0": 87.14120498958022,
		"model_5_pred_0": 87.1268406084284

multimer_1
	2.2.3
		"model_1_multimer_v2_pred_0": 0.8501568830419893,
		"model_2_multimer_v2_pred_0": 0.8486699468672132,
		"model_3_multimer_v2_pred_0": 0.3229293496900477,
		"model_4_multimer_v2_pred_0": 0.8455477128109672,
		"model_5_multimer_v2_pred_0": 0.852226493932774	
	2.2.4
		"model_1_multimer_v2_pred_0": 0.5117391870226631,
		"model_2_multimer_v2_pred_0": 0.4098071256975262,
		"model_3_multimer_v2_pred_0": 0.43044599005298734,
		"model_4_multimer_v2_pred_0": 0.5445095546638219,
		"model_5_multimer_v2_pred_0": 0.5859474538776686

multimer_2
	2.2.3
		"model_1_multimer_v2_pred_0": 0.8336408586490794,
		"model_2_multimer_v2_pred_0": 0.8327994809055634,
		"model_3_multimer_v2_pred_0": 0.8024798134822844,
		"model_4_multimer_v2_pred_0": 0.811666152078712,
		"model_5_multimer_v2_pred_0": 0.8072626282045929
	2.2.4
		"model_1_multimer_v2_pred_0": 0.7638937624852511,
		"model_2_multimer_v2_pred_0": 0.7730047759497934,
		"model_3_multimer_v2_pred_0": 0.8002452325904037,
		"model_4_multimer_v2_pred_0": 0.7938463939736111,
		"model_5_multimer_v2_pred_0": 0.7926498563280233

I have not done such careful comparisons before, but I always ran a couple of tests after updating to new versions and - although of course some minor variability is normal - I do not recall seeing differences of the kind observed for multimer_1...

lucajovine avatar Sep 25 '22 09:09 lucajovine

Hi thanks very much for this. I will investigate and get back to you.

Htomlinson14 avatar Sep 26 '22 10:09 Htomlinson14

Hi again, I'm not seeing the same performance issues on the fasta sequence you provided. Below are the results I have obtained:

2.2.3
        "model_1_multimer_v2_pred_0": 0.7026557510461339,
        "model_2_multimer_v2_pred_0": 0.4509017402688641,
        "model_3_multimer_v2_pred_0": 0.380578570853769,
        "model_4_multimer_v2_pred_0": 0.6910764110918439,
        "model_5_multimer_v2_pred_0": 0.7656100030107779

2.2.4
        "model_1_multimer_v2_pred_0": 0.6930976486128585,
        "model_2_multimer_v2_pred_0": 0.840286483893667,
        "model_3_multimer_v2_pred_0": 0.36765421415043464,
        "model_4_multimer_v2_pred_0": 0.6591293362123893,
        "model_5_multimer_v2_pred_0": 0.8106014168732179

What hardware are you using?

Htomlinson14 avatar Sep 27 '22 14:09 Htomlinson14

Hi, both jobs were run on PCs running Ubuntu 20.04, one with a Quadro RTX 5000 (2.2.3) and the other with a GeForce RTX 2070 (2.2.4).

lucajovine avatar Sep 27 '22 14:09 lucajovine

Hi thanks for this. Would it be possible to reverse the experiment?

Also, what cuda versions are on these machines? (run nvidia-smi)

Htomlinson14 avatar Sep 27 '22 15:09 Htomlinson14

2.2.4 (2022-09-22) with data freshly re-downloaded on DGX A100 (driver 515.65.01) singularity image from docker docker://ghcr.io/truatpasteurdotfr/alphafold:main and ealier version with 2.2.3 (2022-09-13)

}tru@myrdal:~/alphafold$ head -n 8 alphafold-2022-09-13-1905-data-2.2.4-20220923-num_multimer_predictions_per_model_1-IALsG/issue-587/ranking_debug.json 
{
    "iptm+ptm": {
        "model_1_multimer_v2_pred_0": 0.8568935778701908,
        "model_2_multimer_v2_pred_0": 0.8499162052690372,
        "model_3_multimer_v2_pred_0": 0.2961016070831388,
        "model_4_multimer_v2_pred_0": 0.8552748738237211,
        "model_5_multimer_v2_pred_0": 0.851201296704003
    },
tru@myrdal:~/alphafold$ head -n 8 alphafold-2022-09-22-2049-data-2.2.4-20220923-num_multimer_predictions_per_model_1-OCW7T/issue-587/ranking_debug.json 
{
    "iptm+ptm": {
        "model_1_multimer_v2_pred_0": 0.8550884037892792,
        "model_2_multimer_v2_pred_0": 0.39117327804768387,
        "model_3_multimer_v2_pred_0": 0.6731524796823938,
        "model_4_multimer_v2_pred_0": 0.8525504678600079,
        "model_5_multimer_v2_pred_0": 0.8554723246526093
    },

truatpasteurdotfr avatar Sep 27 '22 16:09 truatpasteurdotfr

On ColabFold, we've been using the latest version of jax. One user reported getting different results depending on which GPU they got. They report that A100 gives different result compared to V100 or T4.

sokrypton avatar Sep 27 '22 23:09 sokrypton

...good point @sokrypton: my first comment would've exactly been that I was not aware that the type of GPU that one used made a difference to the actual quality of an inference - other than of course by having enough VRAM to run the corresponding job to start with. But if it does, then this would be a quite crucial piece of information.

@Htomlinson14: yes of course I can reverse the experiment. But since one of the machines is currently running other jobs using 2.2.3, is it fine if on that one I just create a new conda environment to build the 2.2.4 docker image and then run the reverse test, or could the slightly different setup introduce further issues? As for CUDA, both machines have 11.5.

lucajovine avatar Sep 28 '22 02:09 lucajovine

Thanks for raising this. For now, if you are using an affected GPU we recommend pinning to previous versions of the repo. We will investigate the issue further.

Htomlinson14 avatar Sep 28 '22 13:09 Htomlinson14

OK sure, looking forward to hearing what you find...

lucajovine avatar Sep 28 '22 14:09 lucajovine

Sorry I couldn't follow. Which GPUs are affected now?

cihanerkut avatar Sep 28 '22 14:09 cihanerkut

I'm beginning to suspect the old jax library was more consistent across GPUs, while the new jax library is not. Unfortunately, I don't have access to any high-end GPUs (like A100), so I can't do any tests on my side.

I think one informative test to do would be: run on 2.2.3 with A100 and non-A100, then repeat the same experiment with 2.2.4 on A100 and non-A100.

sokrypton avatar Sep 28 '22 14:09 sokrypton

Same here, but if DeepMind/Alphabet wants to indefinitely lend us one to test out things I will not say no! ;-)

lucajovine avatar Sep 28 '22 15:09 lucajovine

I think DeepMind/Alphabet only uses TPUs :P

sokrypton avatar Sep 28 '22 15:09 sokrypton

It's OK, I'll take anything!!

lucajovine avatar Sep 28 '22 15:09 lucajovine

I will be trying on a dual A40 server where I have enough free space to download data, not sure I can do that on my workstation with an RTX 2080...

truatpasteurdotfr avatar Sep 28 '22 15:09 truatpasteurdotfr

A40 (48GB), Driver Version: 510.54

    "iptm+ptm": {
        "model_1_multimer_v2_pred_0": 0.8501128109383987,
        "model_2_multimer_v2_pred_0": 0.8540356829322757,
        "model_3_multimer_v2_pred_0": 0.6771353362331958,
        "model_4_multimer_v2_pred_0": 0.8452873344090073,
        "model_5_multimer_v2_pred_0": 0.8578954696460785
    },

RTX-2080ti (11GB), Driver Version: 515.65.01

    "iptm+ptm": {
        "model_1_multimer_v2_pred_0": 0.8535170484093717,
        "model_2_multimer_v2_pred_0": 0.8481280255454735,
        "model_3_multimer_v2_pred_0": 0.7481300076762746,
        "model_4_multimer_v2_pred_0": 0.8371862945478382,
        "model_5_multimer_v2_pred_0": 0.8328140753338644
    },

I will try to get access on a V100 and rtx1080 machine.

truatpasteurdotfr avatar Sep 29 '22 13:09 truatpasteurdotfr

@truatpasteurdotfr interesting! can you repeat this but with v2.2.3 (to see if the outputs are more consistent between GPUs with the original pinned version of jax).

sokrypton avatar Sep 30 '22 00:09 sokrypton

In case it is useful, here's results for the test posted in this thread from v2.2.4 + cray + A100 + nvidia kernel 510.85.02 + cudatoolkit 11.7.0 + charliecloud in place of docker + databases downloaded on Sept 23, 2022 + removed the 4 offending entries from pdb_seqres/pdb_seqres.txt that match ^CT05

{ "iptm+ptm": { "model_1_multimer_v2_pred_0": 0.8440023053565382, "model_2_multimer_v2_pred_0": 0.8485773361074703, "model_3_multimer_v2_pred_0": 0.26827771102715653, "model_4_multimer_v2_pred_0": 0.8483706916239296, "model_5_multimer_v2_pred_0": 0.8648632996723908 }, "order": [ "model_5_multimer_v2_pred_0", "model_2_multimer_v2_pred_0", "model_4_multimer_v2_pred_0", "model_1_multimer_v2_pred_0", "model_3_multimer_v2_pred_0" ] }

Visualizing them, I see that model 3 has a chain B Arg56 that adopts a different chi3 rotamer, which would sterically clash with chain A (probably at Glu46) if chain A adopted the same relative orientation that it does in the other 4 models. I didn't read up on the methods used to do multimers, but if the side chain rotamers get locked in before the packing is evaluated, or there are other similar things happening that are sensitive to the order of events, then this system may be subject to substantial statistical noise.

therealchrisneale avatar Sep 30 '22 18:09 therealchrisneale

rtx 1080ti (Driver Version: 510.54 )out of memory, as it can not expand into the cpu RAM,,,

2022-10-01 07:07:58.693069: W external/org_tensorflow/tensorflow/compiler/xla/service/platform_util.cc:190] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 11721179136

truatpasteurdotfr avatar Oct 01 '22 07:10 truatpasteurdotfr

Hi all. We have run AF multimer on several different GPUs and are not seeing any significant differences between predictions. @lucajovine would you mind checking to see if there are any differences in the MSAs or templates, on the two machines you are using? Thanks!

Htomlinson14 avatar Oct 12 '22 15:10 Htomlinson14

@Htomlinson14 There should not have been, because the databases used by one of them were a clone of those of the other...

lucajovine avatar Oct 12 '22 17:10 lucajovine

Thanks -- sounds like its not a data issue. Its been difficult to reproduce internally and we have tried on several different GPUs. We will continue to look into this internally but until then we would recommend using v2.2.3, which has no difference in features. It seems likely that jax numerics have changed between jax versions on your hardware. One way to check this would be to change the jax and jaxlib versions, rebuild the docker and rerun the model. One could do this at various jax versions to diagnose whether this is the issue.

Htomlinson14 avatar Oct 14 '22 16:10 Htomlinson14

OK thanks, will stick to 2.2.3 for now and - compatibly with other people's work on the same machine - keep in mind the check you suggested.

lucajovine avatar Oct 19 '22 04:10 lucajovine