axolotl
axolotl copied to clipboard
Add data streaming support through `mosaic-streaming`
Description
This PR adds support for (non-volatile) memory efficient training through StreamingDataset.
Motivation and Context
Context: https://github.com/OpenAccess-AI-Collective/axolotl/issues/585 .
How has this been tested?
I have tested this through docker on a VM.
I'm open to ideas as to how this should be added. Does the repo support an s3 bucket for instance?
Thanks, much appreciated; I'm just checking a few more things before merging.
The experience of contributing to this repo has been very positive.
Hey, thanks for the PR. I just wanted to clarify something I asked previously. This would require user's to preprocess their dataset to Mosaic's format first right? If so, I would prefer this to be documented somewhere near the cloud loading section. For ex, add stream: true to load a Mosaic streaming dataset.
You should also add this parameter to this https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/config.qmd
https://github.com/mosaicml/streaming?tab=readme-ov-file#quick-start
I think it need additional 'StreamingDataset' support for pretraining dataset (completion) in addition to Finetuning dataset.
We can pretrain with Axolotl streaming a data mix from s3?
Hey, thanks for the PR. I just wanted to clarify something I asked previously. This would require user's to preprocess their dataset to Mosaic's format first right? If so, I would prefer this to be documented somewhere near the cloud loading section. For ex, add
stream: trueto load a Mosaic streaming dataset.You should also add this parameter to this https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/config.qmd
https://github.com/mosaicml/streaming?tab=readme-ov-file#quick-start
JSONL should be fine for streaming. see https://github.com/mosaicml/streaming?tab=readme-ov-file#1-prepare-your-data
We can pretrain with Axolotl streaming a data mix from s3?
We can, but I prefer if we include this in a second PR. Right now I would rather see this smaller change working and merged. Expanding on it should be easier later.
We can pretrain with Axolotl streaming a data mix from s3?
We can, but I prefer if we include this in a second PR. Right now I would rather see this smaller change working and merged. Expanding on it should be easier later.
Hey, thanks for the PR. I just wanted to clarify something I asked previously. This would require user's to preprocess their dataset to Mosaic's format first right? If so, I would prefer this to be documented somewhere near the cloud loading section. For ex, add
stream: trueto load a Mosaic streaming dataset.You should also add this parameter to this https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/config.qmd
https://github.com/mosaicml/streaming?tab=readme-ov-file#quick-start
Addressed by ba86339 . Let me know if that addresses all your points.
As per this comment this is not ready for merging. Maybe we want to remove that tag.
I posted a draft of the changes there, but the issue is that the tokenization should happen as we download the data, and right now I'm almost certain it does everything in a batch: it downloads everything, then tokenizes everything, then proceeds to do the fine tuning.
but the issue is that the tokenization should happen as we download the data, and right now I'm almost certain it does everything in a batch: it downloads everything, then tokenizes everything, then proceeds to do the fine tuning.
@fmv1992 , this is correct. I only got to review your code in detail earlier. The section I provided you was incorrect. https://github.com/OpenAccess-AI-Collective/axolotl/blob/60f5ce0569b7f1d522ef81ea986ebfdc98780e6a/src/axolotl/utils/data/sft.py#L121
This function runs the whole dataset, merges it, and perform tokenization at this point here.
https://github.com/OpenAccess-AI-Collective/axolotl/blob/60f5ce0569b7f1d522ef81ea986ebfdc98780e6a/src/axolotl/utils/data/sft.py#L410-L411
The only part that "skips" tokenization before finetuning is the pretaining section that you attempted to modify before.
https://github.com/OpenAccess-AI-Collective/axolotl/blob/60f5ce0569b7f1d522ef81ea986ebfdc98780e6a/src/axolotl/utils/data/sft.py#L70-L102
I have two ideas as of now:
- discuss a better way to handle data preprocessing between the current
pretraining_datasetanddatasetformat as the code is currently messy before continuing further. - Hack around and support streaming for pretraining datasets first and figure sft later. This is also because, your code expects the data in
completionaka ({ "text": ..." }) format. This is not the case for SFT datasets. https://github.com/OpenAccess-AI-Collective/axolotl/blob/ba863392250539eaa1347217672f0e92881583e1/src/axolotl/utils/data/sft.py#L79-L80
I would also appreciate @winglian 's comments on this.
Side note: what should this batch_size be set to? Is it hardcoded to 4 on purpose?
https://github.com/OpenAccess-AI-Collective/axolotl/blob/ba863392250539eaa1347217672f0e92881583e1/src/axolotl/utils/data/sft.py#L76
I'm closing this due to inactivity.
I'm interested in this
@NanoCode012 @djsaunde @SalmanMohammadi @mhenrichsen Let's think about how we can resurrect this feature.