FastFold data problem when use dap

I'm confused about dap,

Can the parameter dap_size only take 2? means row and column？
Is the input data complete or do I need to divide the data by dap_size as input？ thanks

Apr 07 '22 09:04 liuxm117

Thank you for your question.

dap_size refers to how many devices are used for Dynamic Axial Parallelism. It can be set to the number of GPUs used in distributed inference.
I think you need to divide the data, for example: https://github.com/hpcaitech/FastFold/blob/main/fastfold/utils/inject_openfold.py#L46-L47

Apr 08 '22 02:04 Shenggan

thank you for your reply. another question: in fastfold/model/ops.py 161 rows,

einops.EinopsError:  Error while processing rearrange-reduction pattern "b1 b2 n (h d) -> b1 b2 h n d".
 Input tensor shape: torch.Size([1, 2, 132, 258, 256]). Additional info: {'h': 8}.
 Expected 4 dimensions, got 5

I suspect it is because of the previous unsqueeze operation

Apr 08 '22 06:04 liuxm117

To align with OpenFold inference where there is no batch dimension, we use unsqueeze. If your data is entering evoformer with batch latitude, you can just remove the unsqueeze/squeeze operation.

Apr 09 '22 11:04 Shenggan

ok,thanks 1、Are you using A100 40GB or A100 80GB, I don't see the relevant information in the paper 2、as follws, in fastfold/model/kernel/cuda_native/layer_norm.py 34 rows,

fastfold_layer_norm_cuda.backward_affine()
RuntimeError: expected scalar type Half but found Float

about precision, what should I pay attention to?

Apr 11 '22 06:04 liuxm117

we use two platforms in paper, A supercomputer (A100 40 GB) for training experiments and one GPU server (A100 80 GB) for inference experiments.
This mismatch precision in the backward, It should be that the training uses, for example, a mixed precision method that implicitly changes the precision of parameters. We may need more information to determine the reason for this error.

By the way, FastFold now only support float32 and bfloat16.

Apr 12 '22 02:04 Shenggan

by use fp32, the previous problem has solved, but it still error in backward,

File "/opt/conda/lib/python3.8/site-packages/fastfold-0.1.0b0-py3.8-linux-x86_64.egg/fastfold/model/kernel/cuda_native/softmax.py", line 88, in backward
    grad_bias = torch.sum(grad_input, dim=1, keepdim=True)
RuntimeError: CUDA error: an illegal memory access was encountered

Apr 13 '22 05:04 liuxm117

We can't reproduce the error, you can try to provide an example code and hardware platform information. So that we can assist you to solve the problem.

Apr 18 '22 02:04 Shenggan

hello， I find the error maybe caused by softmax， I am confused about follow program

        if nonbatched_bias is not None:
            # logits += nonbatched_bias.unsqueeze(1)
            bias = gather_async_opp(*nonbatched_bias, dim=1)
            bias = rearrange(bias, 'b q k h -> b h q k')
            weights = scale_mask_bias_softmax(logits, mask, bias.unsqueeze(1), self.scaling)
        else:
            weights = scale_mask_softmax(logits, mask, self.scaling)

```in my opinion,  it is used to replace softmax(logit*self.scaling +bias), what is mask's role? how much performance increase can this bring?

Apr 22 '22 08:04 liuxm117

In some scenarios, such as single sequence inference, mask is not necessary. But in scenarios where padding is used, mask is necessary.

If you have problems with this kernel, you can just use torch's native API representation, such as softmax(logit*self.scaling +bias). This will cause some performance degradation, but it maybe not a big problem.

Apr 22 '22 09:04 Shenggan