torchtitan
torchtitan copied to clipboard
unify data loading from HF and from disk
Stack from ghstack (oldest at bottom):
- -> #287
As titled. We can just use the load_dataset
HF API to unify different use cases.
-
load_dataset
is flexible in that, it can take a HF hub dataset repository or a local directory. The behavior is consistent as long as the underlying data is the same. It supports common data formats such as .txt, .json, .json.gz, .csv, .parquet, etc. - According to this post,
load_dataset
works in three steps: download the dataset, then prepare it as an arrow dataset, and finally return a memory mapped arrow dataset. In particular it creates a cache directory to store the arrow data and the subsequent cache files for map.
- Previously used
load_from_disk
can only load dataset saved bysave_to_disk
(in arrow format), which can be viewed as a way to load "preprocessed" dataset:
load_from_disk
directly returns a memory mapped dataset from the arrow file (similar to Dataset.from_file). It doesn't create a cache diretory, instead all the subsequent map calls write in the same directory as the original data.
- For large dataset (which cannot fit in memory), we need to set
streaming=True
forload_dataset
, even if it is stored in a local directory. One might thinkload_from_disk
is better because of point 3 above; however, to preprocess the huge dataset and callsave_to_disk
, one needs to load it in memory in the first place.
For all the reasons listed above, let's not use load_from_disk
which assumes preprocessed data in arrow format.
Let's use load_dataset
which supports common data formats, and set streaming=True
for large dataset, no matter it is from HF or from local disk.
P.S.:
- This PR updates the data file from
arrow
tojson
, while keeping the same data (first 45,000 entries ofc4
). -
c4
is now available to run large scale experiments. Performance verified.