fms-fsdp issues

Typo `os.path.is_file` instead of `os.path.isfile`

There seems to be a typo in the `Checkpointer` class. The `_cleanup` method calls `os.path.is_file` instead of `os.path.isfile`.

johannesschmude

Add regression tests for get_latest and get_oldest functions

1

Edit: Bug fix was applied in #119. This PR thus only adds two unit tests to prevent bug regression. --- Code was calling `os.path.join()` too many times and causing the...

weiji14

tokenization on-the-fly for long documents

2

As we may have to deal with very long documents up to millions of characters/tokens, the `dataloader` may need to be tested and revised as needed when [it](https://github.com/foundation-model-stack/fms-fsdp/blob/2767c796422eade29a72d36fdf5d4d3a8af0672b/fms_fsdp/utils/dataloader_utils.py#L134) aims at...

dangxuanhong

Minimal implementation of muP scaling for Llama

Implement [muP scaling](https://arxiv.org/abs/2203.03466) for Llama models. Model follows muP scaling laws but introduces the minimal set of extra tunable hyperparameters that allows us to recover prior behavior - thus may...

daviswer

fix: Correct the typo

Signed-off-by: Akash Nayak

Akash-Nayak

FMS-FSDP running on A100 8GPU machine failed with NCCL error messages

A100 8GPU machine with NVLink connections; docker image: nvcr.io/nvidia/pytorch:23.12-py3; git clone https://github.com/foundation-model-stack/fms-fsdp.git git clone https://github.com/foundation-model-stack/foundation-model-stack.git git clone https://github.com/huggingface/optimum-nvidia.git cd foundation-model-stack pip install -e . cd ../fms-fsdp/ pip install -r requirements.txt...

HenryTangMain

The default model variant is 7b but it is not supported.

2

the default model variant is "7b": https://github.com/foundation-model-stack/fms-fsdp/blob/65b0ea670fa375bb0f7f6a285e7229bb96ebdd0f/fms_fsdp/config/training.py#L8 but it is not in the supported white list: https://github.com/foundation-model-stack/fms-fsdp/blob/65b0ea670fa375bb0f7f6a285e7229bb96ebdd0f/fms_fsdp/utils/config_utils.py#L25

HenryTangMain

Repeatability of Small Model Training Script with fixed seed(s) and same dataset

1

We observed noticeable variability when re-running the FSDP model training script for a small 1.xB llama2 model with fixed seed(s) and same tokens. Below is a snapshot of the evaluation...

pad9153

Support nested folders for datasets

1

The current code only looks for files in the dataset folder. When the dataset has additional nested folders, these arrow files are not seen

thinkahead

Suppress spammy warnings

Current code prints multiple warnings from each gpu at the start of training, which clutters up the log. Updates dataloader and process group constructors to eliminate these warnings, respectively: ```...

daviswer

fms-fsdp
fms-fsdp copied to clipboard

Metadata

Typo `os.path.is_file` instead of `os.path.isfile`

Add regression tests for get_latest and get_oldest functions

tokenization on-the-fly for long documents

Minimal implementation of muP scaling for Llama

fix: Correct the typo

FMS-FSDP running on A100 8GPU machine failed with NCCL error messages

The default model variant is 7b but it is not supported.

Repeatability of Small Model Training Script with fixed seed(s) and same dataset

Support nested folders for datasets

Suppress spammy warnings

← Metadata

Owner

Metadata

fms-fsdp fms-fsdp copied to clipboard

Metadata

← Metadata

Owner

Metadata

fms-fsdp
fms-fsdp copied to clipboard