different dap
hi, I try different dap_size, such as dap_size=2, dap_size=4, but with the increase of dap, the decrease of gpu memory is not obvious, have you tried this?
According to our experimental results, the effect of DAP is significant. I hope you can provide more details that we can reproduce, or more specific experimental setup and final experimental results.
according to run perf.py, I recorded occupation of memory under different dap, dap=1 61% dap=2 37% dap=4 23%,In my opinion, Ideally, as dap goes from 1 to 2 to 4, the memory should drop by half. What's causing the drop of memory not be enough?
I think this result is reasonable. Although DAP slices most of the activation, the practical situation is that the linear reduction of the theory cannot be obtained because the model weights and some parts of the activation are not partitionable.
another question, will dap bring more memory drop than using activation chechpointing? have you done any comparison experiments?
another question, will dap bring more memory drop than using activation chechpointing? have you done any comparison experiments?
DAP and checkpoint techniques are orthogonal and can be used together. Further memory reduction can also be obtained by using DAP on top of checkpoint.
I encountered a very strange phenomenon, I applied dap to openfold, but when dap goes from 1 to 2, the memory did not decrease but increase
What is the length of the amino acid sequence used in your test? As we mentioned in our paper, we recommend using DAP for distributed inference only when the length is greater than 1k. This is because the communication overhead (time and memory) is more significant when the sequence length is short.
I just use 256 and 384, do you mean it's caused by the communication overhead ?
Possible reasons why DAP does not work well on short sequences: 1) DAP can only reduce the memory needed for intermediate activation, and when the sequence length is not long enough, this part of the memory occupation is relatively small. 2) Because the implementing of DAP requires additional buffer.
- How to understand DAP requires additional buffer,did your paper mentioned it ?
You can refer, for example, to this place: https://github.com/hpcaitech/FastFold/blob/main/fastfold/distributed/comm.py#L56-L58
Scalability is presented as the reason in the paper, as it is the more fundamental reason for not using DAP.