mamba
mamba copied to clipboard
Add documentation/tests on how to use inference_params to Mamba to generate sequences by parts
Right now there are no tests / doc on how to continue generation with a given state.
I think I figured it out: at least generating by parts and by whole works, though I'm not 100% sure: (seqlen_offset value is not used by Mamba class. only that it >0; max_batch_size is not even mentioned in class), so if the test is not completely bogus, I can wrap it in pytest and make PR)
Can you explain the use case here. Would this be like if the model is handling topic a, we're using and updating state a for each inference?
Can you explain the use case here. Would this be like if the model is handling topic a, we're using and updating state a for each inference?
Yes, manual cache handling over long sequence of text and long period of time
E.g. "Chapter N: previous context goes here. State at this point should be cached for days and stored to disk to not reparse things.\n\n Chapter N+1: (lots of long text being rewritten)"
Just to be sure this is for a different use case than using cg=True for generate? Like in generate which is defined by DecodingCGCache and capture_graph in generation https://github.com/state-spaces/mamba/blob/main/mamba_ssm/utils/generation.py
I'm going to try this out on a handbook or 2 and see how it does.
Similar. It's about the manual control over the every aspect of cache (and hence state) for model. The model itself uses InferenceParms.