llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Delta input cpt support

Open XiaohanZhangCMU opened this issue 1 year ago • 0 comments

Enable delta table as input for CPT

For CPT, you need to provide some tokenizer arguments so the resulted MDS dataset can be written

python scripts/data_prep/convert_delta_to_json.py --delta_table_name main.streaming.random_cpt_table --processes 128 --cluster_id 1214-001856-19o83v16 --task_type CONTINUED_PRETRAIN --mds_output_path /tmp/test_mds11 --json_output_path /tmp/test_json11 --tokenizer mosaicml/mpt-7b

XiaohanZhangCMU avatar Jan 09 '24 21:01 XiaohanZhangCMU