axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

Add data streaming support through `mosaic-streaming`

Open fmv1992 opened this issue 1 year ago • 9 comments

Description

This PR adds support for (non-volatile) memory efficient training through StreamingDataset.

Motivation and Context

Context: https://github.com/OpenAccess-AI-Collective/axolotl/issues/585 .

How has this been tested?

I have tested this through docker on a VM.

I'm open to ideas as to how this should be added. Does the repo support an s3 bucket for instance?

fmv1992 avatar Apr 16 '24 12:04 fmv1992

Thanks, much appreciated; I'm just checking a few more things before merging.

The experience of contributing to this repo has been very positive.

fmv1992 avatar Apr 17 '24 14:04 fmv1992

Hey, thanks for the PR. I just wanted to clarify something I asked previously. This would require user's to preprocess their dataset to Mosaic's format first right? If so, I would prefer this to be documented somewhere near the cloud loading section. For ex, add stream: true to load a Mosaic streaming dataset.

You should also add this parameter to this https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/config.qmd

https://github.com/mosaicml/streaming?tab=readme-ov-file#quick-start

NanoCode012 avatar Apr 18 '24 13:04 NanoCode012

I think it need additional 'StreamingDataset' support for pretraining dataset (completion) in addition to Finetuning dataset.

Kesta-bos avatar Apr 21 '24 06:04 Kesta-bos

We can pretrain with Axolotl streaming a data mix from s3?

ehartford avatar Apr 21 '24 23:04 ehartford

Hey, thanks for the PR. I just wanted to clarify something I asked previously. This would require user's to preprocess their dataset to Mosaic's format first right? If so, I would prefer this to be documented somewhere near the cloud loading section. For ex, add stream: true to load a Mosaic streaming dataset.

You should also add this parameter to this https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/config.qmd

https://github.com/mosaicml/streaming?tab=readme-ov-file#quick-start

JSONL should be fine for streaming. see https://github.com/mosaicml/streaming?tab=readme-ov-file#1-prepare-your-data

winglian avatar Apr 22 '24 01:04 winglian

We can pretrain with Axolotl streaming a data mix from s3?

We can, but I prefer if we include this in a second PR. Right now I would rather see this smaller change working and merged. Expanding on it should be easier later.

fmv1992 avatar Apr 22 '24 11:04 fmv1992

We can pretrain with Axolotl streaming a data mix from s3?

We can, but I prefer if we include this in a second PR. Right now I would rather see this smaller change working and merged. Expanding on it should be easier later.

Hey, thanks for the PR. I just wanted to clarify something I asked previously. This would require user's to preprocess their dataset to Mosaic's format first right? If so, I would prefer this to be documented somewhere near the cloud loading section. For ex, add stream: true to load a Mosaic streaming dataset.

You should also add this parameter to this https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/config.qmd

https://github.com/mosaicml/streaming?tab=readme-ov-file#quick-start

Addressed by ba86339 . Let me know if that addresses all your points.

fmv1992 avatar Apr 22 '24 11:04 fmv1992

As per this comment this is not ready for merging. Maybe we want to remove that tag.

I posted a draft of the changes there, but the issue is that the tokenization should happen as we download the data, and right now I'm almost certain it does everything in a batch: it downloads everything, then tokenizes everything, then proceeds to do the fine tuning.

fmv1992 avatar Apr 22 '24 11:04 fmv1992

but the issue is that the tokenization should happen as we download the data, and right now I'm almost certain it does everything in a batch: it downloads everything, then tokenizes everything, then proceeds to do the fine tuning.

@fmv1992 , this is correct. I only got to review your code in detail earlier. The section I provided you was incorrect. https://github.com/OpenAccess-AI-Collective/axolotl/blob/60f5ce0569b7f1d522ef81ea986ebfdc98780e6a/src/axolotl/utils/data/sft.py#L121

This function runs the whole dataset, merges it, and perform tokenization at this point here.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/60f5ce0569b7f1d522ef81ea986ebfdc98780e6a/src/axolotl/utils/data/sft.py#L410-L411

The only part that "skips" tokenization before finetuning is the pretaining section that you attempted to modify before.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/60f5ce0569b7f1d522ef81ea986ebfdc98780e6a/src/axolotl/utils/data/sft.py#L70-L102

I have two ideas as of now:

  1. discuss a better way to handle data preprocessing between the current pretraining_dataset and dataset format as the code is currently messy before continuing further.
  2. Hack around and support streaming for pretraining datasets first and figure sft later. This is also because, your code expects the data in completion aka ({ "text": ..." }) format. This is not the case for SFT datasets. https://github.com/OpenAccess-AI-Collective/axolotl/blob/ba863392250539eaa1347217672f0e92881583e1/src/axolotl/utils/data/sft.py#L79-L80

I would also appreciate @winglian 's comments on this.


Side note: what should this batch_size be set to? Is it hardcoded to 4 on purpose?

https://github.com/OpenAccess-AI-Collective/axolotl/blob/ba863392250539eaa1347217672f0e92881583e1/src/axolotl/utils/data/sft.py#L76

NanoCode012 avatar Apr 22 '24 12:04 NanoCode012

I'm closing this due to inactivity.

fmv1992 avatar Aug 27 '24 16:08 fmv1992

I'm interested in this

tmm1 avatar Aug 27 '24 17:08 tmm1

@NanoCode012 @djsaunde @SalmanMohammadi @mhenrichsen Let's think about how we can resurrect this feature.

winglian avatar May 06 '25 13:05 winglian