LLMRoofline
LLMRoofline copied to clipboard

feifeibear

→

Metadata

Compare different hardware platforms via the Roofline Model for LLM inference tasks.

Reame
Issues

Results 2 LLMRoofline issues

Sort by recently updated

Does it support NPU analysis?

1

pangsg

关于LLM-Viewer flash attention的Memory Access的计算

大佬问一下，按flashattention的理论IO复杂度分心是N^2d^2/M，Memory Access应该是非常低的，下面的代码计算出来的memory access非常大，而且用的是T_r不是T_c，和flashattention的理论分析不太一样，请问如何理解下面的计算？ ```python3 if use_flashattention: name = f"fused_attention" bandwidth, max_OPS, onchip_buffer = self.get_hardware_info() # flashattention-2 https://arxiv.org/pdf/2307.08691.pdf block_size_r = min(math.ceil(onchip_buffer / (kv_byte * head_size)), head_size) n_blocks_r = math.ceil(seqlen / block_size_r)...

beaulian