llm-foundry
llm-foundry copied to clipboard
LLM training code for Databricks foundation models
There are 4 TODOs regarding compiled flex attention that needed to be investigated before checking in. See the tests for more details. TL;DR: - I think sequence lengths which are...
## 🚀 Feature Request TransformerEngine has advanced Attention kernels, including support for FlashAttention-3 and low-precision kernels. ## Motivation Having TransformerEngine's Attention as an `attn_impl` option would be super nice due...
Datasets throws this error: https://github.com/huggingface/datasets/blob/661d7bac29689e2d9eb74fba3d243939d6e9f25b/src/datasets/splits.py#L362 when a split doesn't match the regex. This we catch and throw to the user.
## Manual Test `test-log-model-no-save-hNNfeX` https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/482477677751793/runs/ea44b38569974edf8573b3f66558a15f?o=7395834863327820 Two models logged at each batch. The last batch ba10 is registered and ba5 is only logged.
With streaming upgraded to 0.9.1, the unit test would lead to inf loop.
When I set moe_loss_weight:0 ``` [rank7]: File "/home/syx/miniconda3/envs/lmf/lib/python3.11/site-packages/composer/trainer/trainer.py", line 2907, in [rank7]: **kwargs: self._train_microbatches(microbatches, loss_dict, **kwargs).item(), [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/home/syx/miniconda3/envs/lmf/lib/python3.11/site-packages/composer/trainer/trainer.py", line 3075, in _train_microbatches [rank7]: microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size,...