data problem when use dap
I'm confused about dap,
- Can the parameter dap_size only take 2? means row and column?
- Is the input data complete or do I need to divide the data by dap_size as input? thanks
Thank you for your question.
dap_sizerefers to how many devices are used for Dynamic Axial Parallelism. It can be set to the number of GPUs used in distributed inference.- I think you need to divide the data, for example: https://github.com/hpcaitech/FastFold/blob/main/fastfold/utils/inject_openfold.py#L46-L47
thank you for your reply. another question: in fastfold/model/ops.py 161 rows,
einops.EinopsError: Error while processing rearrange-reduction pattern "b1 b2 n (h d) -> b1 b2 h n d".
Input tensor shape: torch.Size([1, 2, 132, 258, 256]). Additional info: {'h': 8}.
Expected 4 dimensions, got 5
I suspect it is because of the previous unsqueeze operation
To align with OpenFold inference where there is no batch dimension, we use unsqueeze. If your data is entering evoformer with batch latitude, you can just remove the unsqueeze/squeeze operation.
ok,thanks 1、Are you using A100 40GB or A100 80GB, I don't see the relevant information in the paper 2、as follws, in fastfold/model/kernel/cuda_native/layer_norm.py 34 rows,
fastfold_layer_norm_cuda.backward_affine()
RuntimeError: expected scalar type Half but found Float
about precision, what should I pay attention to?
- we use two platforms in paper, A supercomputer (A100 40 GB) for training experiments and one GPU server (A100 80 GB) for inference experiments.
- This mismatch precision in the backward, It should be that the training uses, for example, a mixed precision method that implicitly changes the precision of parameters. We may need more information to determine the reason for this error.
By the way, FastFold now only support float32 and bfloat16.
by use fp32, the previous problem has solved, but it still error in backward,
File "/opt/conda/lib/python3.8/site-packages/fastfold-0.1.0b0-py3.8-linux-x86_64.egg/fastfold/model/kernel/cuda_native/softmax.py", line 88, in backward
grad_bias = torch.sum(grad_input, dim=1, keepdim=True)
RuntimeError: CUDA error: an illegal memory access was encountered
We can't reproduce the error, you can try to provide an example code and hardware platform information. So that we can assist you to solve the problem.
hello, I find the error maybe caused by softmax, I am confused about follow program
if nonbatched_bias is not None:
# logits += nonbatched_bias.unsqueeze(1)
bias = gather_async_opp(*nonbatched_bias, dim=1)
bias = rearrange(bias, 'b q k h -> b h q k')
weights = scale_mask_bias_softmax(logits, mask, bias.unsqueeze(1), self.scaling)
else:
weights = scale_mask_softmax(logits, mask, self.scaling)
```in my opinion, it is used to replace softmax(logit*self.scaling +bias), what is mask's role? how much performance increase can this bring?
In some scenarios, such as single sequence inference, mask is not necessary. But in scenarios where padding is used, mask is necessary.
If you have problems with this kernel, you can just use torch's native API representation, such as softmax(logit*self.scaling +bias). This will cause some performance degradation, but it maybe not a big problem.