TransformerEngine
TransformerEngine copied to clipboard
Generation tutorial for Gemma model
Description
I added the tutorials with finetuning and with generation for the Gemma model. Moreover I added few features that were neccessary to make my tutorials work.
Type of change
- [ ] Documentation change (change only to the documentation, either a fix or a new content)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Breaking change (fix or feature that would cause existing functionality to not work as expected)
Changes
- Two new notebooks in the docs: one with finetuning for Gemma - analogous to the tutorial with Llama, one with generation for the Gemma,
- Generalized the kernel for rotary positional encoding to allow the sequences to start with different encoding positions,
- Added the kernel to effectively save key and value to the kv_cache,
- Expanded the class
InferenceParams- which is responsible for caching k and v, - Changed DotProductAttention to run THD attention when there are ragged tensors in kv_cache.
Checklist:
- [x] I have read and followed the contributing guidelines
- [x] The functionality is complete
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes
Future work:
TransformerLayer does not support thd and it is a problem. The solutions right now works that way:
- one need to call
setup_before_new_inputbefore forward to indicate the sequence lengths, - then one passes forward with TransformerLayer with
self_attn_format='thd'and padded sequences with shapebshd, - all layers get input is [b,s,*] format, not in [t, *] (including attention)
InferenceParams.retrieve_from_kv_cacheretrieves key_layer in bshd or ths format depending ofinference_params.qkv_format,
As can be seen, it is quite messy workaround. How I think it should be done in the future:
- TransformerLayer supports thd and we do not need
setup_before_new_inputat all, InferenceParamsstore lengths of cached sequences for each layer,- for each TransformerLayer invocation, provided sequences are copied to the cache and lengths of cached sequences for this layer are updated,
To do this one will need to remove save_to_kv_cache() kernel and write save_to_kv_cache_sbhd() and save_to_kv_cache_thd() (no bshd, because cache has shape sbhd both for bshd abd sbhd).
Logic of updating sequence lenghts needs to be moved from the setup_before_new_input into save_to_kv_cache.
It is worth noting that we need to take care of backwards compatibility. Right now generation works only for bsdh/sbdh and one needs to manually update self.sequence_len_offset. I think we can write setter which will update statistic for each of the layer when sequence_len_offset will be changed.
If TransformerLayer support of thd will not be added in near future, I propose to write sequence lengths into inference_params.cu_seqlens, note that it is beta (in the future probably cu_seqlens will be added as an argument to the TransformerLayer). Then use TransformerLayer with bsdh. If MultiHeadAttention gets inference_params.cu_seqlens != None, it converts bshd with padding into thd, calls save_to_kv_cache etc. and run DotProductAttention with a thd and then converts output back to the bshd.
/te-ci pytorch
/te-ci pytorch
/te-ci pytorch
@cyanguwa, need your help in reviewing the attention related files (attention.py test_fused_rope.py, etc.)
/te-ci pytorch
/te-ci pytorch
/te-ci pytorch
/te-ci pytorch