GeorgiosSmyrnis issues

Results 11 issues of


                                            GeorgiosSmyrnis

Add option to predownload data from s3 at the start of each checkpoint.

This adds an option that allows for predownloading of data at the start of each checkpoint to local storage. This helps with potential s3 errors.

Add unit test for source mixing + Fix naming within tars.

This adds a unit test for mixing sources, both with and without sampling. This also fixes the naming scheme within tars, which could cause issues if two sequences within the...

Checkpoint skipping.

This PR adds the capability of checkpoint skipping if needed - useful if resuming and want to skip some batches.

Change GeGLU and add MQA.

This PR changes the GeGLU MLP and adds support for MQA.

Allow mixing for pretokenized data.

This enables mixing of pretokenized data with the tokenize_shuffle.py script. This is allowed by the `--pretok_tars` flag, which assumes that the tarfiles that the script contain already tokenized data.

[WIP] Attention across documents.

This adds a flag that stops attention from going across documents, identified by the EOT token. The loss for the token right after the EOT token is ignored. TODO: add...

Update README on tokenization.

This PR does the following: - Consolidates the instructions on how to run tokenization in a single README. - Adds a sample script on how to run tokenization on a...

Fix too many tokens requested edge case.

Sometimes, the model needs to do a few more training steps in a new epoch, and it would load an entire checkpoints worth of data for that. This PR limits...

Webdataset version issue

`group_by_keys_nothrow` breaks with `webdataset>=0.2.90`.