Does it supprts batch inference?

Open SkyAndCloud opened this issue 1 year ago • 1 comments

Hi guys, thank you for this excellent work! It seems that this code does not consider the attention_mask during chunkllama inference. Does this code support batch inference?

May 23 '24 09:05 SkyAndCloud

Yes! We use flash_attn_func for better efficiency and simplicity. Changing to flash_attn_varlen_func should not be difficult. If you encounter any difficulties, please feel free to leave a comment.

May 23 '24 13:05 ChenxinAn-fdu