FastFold different dap

hi, I try different dap_size, such as dap_size=2, dap_size=4, but with the increase of dap, the decrease of gpu memory is not obvious, have you tried this？

May 17 '22 00:05 liuxm117

According to our experimental results, the effect of DAP is significant. I hope you can provide more details that we can reproduce, or more specific experimental setup and final experimental results.

May 31 '22 05:05 Shenggan

according to run perf.py, I recorded occupation of memory under different dap, dap=1 61% dap=2 37% dap=4 23%，In my opinion, Ideally, as dap goes from 1 to 2 to 4, the memory should drop by half. What's causing the drop of memory not be enough?

Jun 01 '22 06:06 liuxm117

I think this result is reasonable. Although DAP slices most of the activation, the practical situation is that the linear reduction of the theory cannot be obtained because the model weights and some parts of the activation are not partitionable.

Jun 01 '22 06:06 Shenggan

another question, will dap bring more memory drop than using activation chechpointing? have you done any comparison experiments?

Jun 01 '22 07:06 liuxm117

another question, will dap bring more memory drop than using activation chechpointing? have you done any comparison experiments?

Jun 01 '22 07:06 liuxm117

DAP and checkpoint techniques are orthogonal and can be used together. Further memory reduction can also be obtained by using DAP on top of checkpoint.

Jun 01 '22 07:06 Shenggan

I encountered a very strange phenomenon, I applied dap to openfold, but when dap goes from 1 to 2, the memory did not decrease but increase

Jun 02 '22 01:06 liuxm117

What is the length of the amino acid sequence used in your test? As we mentioned in our paper, we recommend using DAP for distributed inference only when the length is greater than 1k. This is because the communication overhead (time and memory) is more significant when the sequence length is short.

Jun 02 '22 01:06 Shenggan

I just use 256 and 384, do you mean it's caused by the communication overhead ?

Jun 02 '22 02:06 liuxm117

Possible reasons why DAP does not work well on short sequences: 1) DAP can only reduce the memory needed for intermediate activation, and when the sequence length is not long enough, this part of the memory occupation is relatively small. 2) Because the implementing of DAP requires additional buffer.

Jun 02 '22 02:06 Shenggan

How to understand DAP requires additional buffer，did your paper mentioned it ?

Jun 02 '22 02:06 liuxm117

You can refer, for example, to this place: https://github.com/hpcaitech/FastFold/blob/main/fastfold/distributed/comm.py#L56-L58

Scalability is presented as the reason in the paper, as it is the more fundamental reason for not using DAP.

Jun 02 '22 02:06 Shenggan