LLMRoofline icon indicating copy to clipboard operation
LLMRoofline copied to clipboard

Compare different hardware platforms via the Roofline Model for LLM inference tasks.

Results 2 LLMRoofline issues
Sort by recently updated
recently updated
newest added

大佬问一下,按flashattention的理论IO复杂度分心是N^2d^2/M,Memory Access应该是非常低的,下面的代码计算出来的memory access非常大,而且用的是T_r不是T_c,和flashattention的理论分析不太一样,请问如何理解下面的计算? ```python3 if use_flashattention: name = f"fused_attention" bandwidth, max_OPS, onchip_buffer = self.get_hardware_info() # flashattention-2 https://arxiv.org/pdf/2307.08691.pdf block_size_r = min(math.ceil(onchip_buffer / (kv_byte * head_size)), head_size) n_blocks_r = math.ceil(seqlen / block_size_r)...