llm-foundry
llm-foundry copied to clipboard
Delta input cpt support
Enable delta table as input for CPT
For CPT, you need to provide some tokenizer arguments so the resulted MDS dataset can be written
python scripts/data_prep/convert_delta_to_json.py --delta_table_name main.streaming.random_cpt_table --processes 128 --cluster_id 1214-001856-19o83v16 --task_type CONTINUED_PRETRAIN --mds_output_path /tmp/test_mds11 --json_output_path /tmp/test_json11 --tokenizer mosaicml/mpt-7b