ChunkLlama
ChunkLlama copied to clipboard
Does it supprts batch inference?
Hi guys, thank you for this excellent work! It seems that this code does not consider the attention_mask during chunkllama inference. Does this code support batch inference?
Yes! We use flash_attn_func for better efficiency and simplicity. Changing to flash_attn_varlen_func should not be difficult. If you encounter any difficulties, please feel free to leave a comment.