Wang Chuan
Wang Chuan
@jwcrawley I tried to build ORB-SLAM with ROS-Jade in Ubuntu 14.10, but encountered a problem issued at https://github.com/raulmur/ORB_SLAM/issues/65. Could you help me analyze what may cause the problem? Thanks!
求能在这里给一下正确的表达式吗,我看到这里被卡住了。我的名字不重要哈~~~
same problem here
Same issue here. I have a BERT model which has a BertSelfFlashAttention class, inside which the core function about FA is `flash_attn_varlen_func`. Now I'm trying to run the code from...
@tridao Could you give me any hint about the following profile code, which is straight-forward to run? The results are ``` 100%|███████████████████████████████████| 10000/10000 [00:03 python 3.9 > cuda 12.4 >...
Yes, after changing to hdim 64, the speed of FA3 is higher than FA2 now. ``` size: [12288, 12, 64] 100%|███████████████████████████████████| 10000/10000 [00:04
If it is not hard, is it possible for you to give me some hints so that I can adapt your source code (C++/CUDA) to support hdim 32? I can...
Really thankful to the hint. I carefully looked into the code, and I guess `generate_kernels.py` and `flash_api.cpp` may be not that hard to add hdim 32, however for tile_size, I...
@tridao I successfully implemented hdim32 based on your code, however, the speed is nearly equal to FA2 (higher than before). ``` size: [12288, 12, 32] 100%|███████████████████████████████████| 10000/10000 [00:03