text
text copied to clipboard
[WIP] Add small, large, 3b, 11b pre-trained weights for t5
Description
Add t5 bundlers for small, large, 3b, and 11b configuration
Process
- Upload pre-trained weights and create bundler objects for each configuration. There
_pathattributes point to the corresponding checkpoint uploaded to our bucket. - the
basemodel projected the query, keys, and values to dimensionembed_dim / num_heads. The3band11bmodels break from this convention and have specified dimensions that these tensors get projected to. Therefore, we introduce a new parameterqkv_diminT5Confand all our t5 modules so that this projection dimension can be taken into account. In order to do this, we also have to introduce a new methodT5MultiheadAttention._t5_in_projection. This is a modified version totorch.nn.function._in_projection. This change was necessary becausetorch.nn.functional._in_projectionexpects that the query, keys, and values all get projected to the dimensionembed_dim/num_heads, which is an assumption that is no longer true with the3band11bconfigurations.
Testing
Add integration tests for small and large. 3b and 11b have large checkpoints (12GB and 47GB) each taking a long time to load (see below). In order to keep the CI tests time efficient, we have not checked in their unittests, though we have tested them locally to ensure they are currently performing as expected.
pytest test/prototype/integration_tests/test_models.py
small testing takes 24.67s
base testing takes 74.72s (0:01:14)
large testing takes 251.09s (0:04:11)
3b testing takes 1034.32s (0:17:14)
11b testing takes 0s (0:0:0) (TBD)