Wang Chuan comments

Results 10 comments of


                                            Wang Chuan

trafficstars

how can i run this on ROS(jade)?

@jwcrawley I tried to build ORB-SLAM with ROS-Jade in Ubuntu 14.10, but encountered a problem issued at https://github.com/raulmur/ORB_SLAM/issues/65. Could you help me analyze what may cause the problem? Thanks!

4.3.1算法推导的第一个公式

求能在这里给一下正确的表达式吗，我看到这里被卡住了。我的名字不重要哈~~~

ValueError: Cell is empty

same problem here

Example code fails with google.protobuf.message.DecodeError: Truncated message

Same here.

Applying FA3 in qwen2 model fine-tuning is slower than FA2

Same issue here. I have a BERT model which has a BertSelfFlashAttention class, inside which the core function about FA is `flash_attn_varlen_func`. Now I'm trying to run the code from...

Applying FA3 in qwen2 model fine-tuning is slower than FA2

@tridao Could you give me any hint about the following profile code, which is straight-forward to run? The results are ``` 100%|███████████████████████████████████| 10000/10000 [00:03 python 3.9 > cuda 12.4 >...

Applying FA3 in qwen2 model fine-tuning is slower than FA2

Yes, after changing to hdim 64, the speed of FA3 is higher than FA2 now. ``` size: [12288, 12, 64] 100%|███████████████████████████████████| 10000/10000 [00:04

Applying FA3 in qwen2 model fine-tuning is slower than FA2

If it is not hard, is it possible for you to give me some hints so that I can adapt your source code (C++/CUDA) to support hdim 32? I can...

Applying FA3 in qwen2 model fine-tuning is slower than FA2

Really thankful to the hint. I carefully looked into the code, and I guess `generate_kernels.py` and `flash_api.cpp` may be not that hard to add hdim 32, however for tile_size, I...

Applying FA3 in qwen2 model fine-tuning is slower than FA2

@tridao I successfully implemented hdim32 based on your code, however, the speed is nearly equal to FA2 (higher than before). ``` size: [12288, 12, 32] 100%|███████████████████████████████████| 10000/10000 [00:03