llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

LLM training code for Databricks foundation models

Results 267 llm-foundry issues
Sort by recently updated
recently updated
newest added

There are 4 TODOs regarding compiled flex attention that needed to be investigated before checking in. See the tests for more details. TL;DR: - I think sequence lengths which are...

## 🚀 Feature Request TransformerEngine has advanced Attention kernels, including support for FlashAttention-3 and low-precision kernels. ## Motivation Having TransformerEngine's Attention as an `attn_impl` option would be super nice due...

enhancement

Datasets throws this error: https://github.com/huggingface/datasets/blob/661d7bac29689e2d9eb74fba3d243939d6e9f25b/src/datasets/splits.py#L362 when a split doesn't match the regex. This we catch and throw to the user.

## Manual Test `test-log-model-no-save-hNNfeX` https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/482477677751793/runs/ea44b38569974edf8573b3f66558a15f?o=7395834863327820 Two models logged at each batch. The last batch ba10 is registered and ba5 is only logged.

With streaming upgraded to 0.9.1, the unit test would lead to inf loop.

When I set moe_loss_weight:0 ``` [rank7]: File "/home/syx/miniconda3/envs/lmf/lib/python3.11/site-packages/composer/trainer/trainer.py", line 2907, in [rank7]: **kwargs: self._train_microbatches(microbatches, loss_dict, **kwargs).item(), [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/home/syx/miniconda3/envs/lmf/lib/python3.11/site-packages/composer/trainer/trainer.py", line 3075, in _train_microbatches [rank7]: microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size,...

bug