llm-foundry
llm-foundry copied to clipboard
Validation
JIRA: https://databricks.atlassian.net/jira/software/c/projects/STR/issues/STR-141?filter=allissues
This script is useful in scenarios where the FT API data input has been malformed. It acts as a preventive measure to ensure data integrity and helps in cost assessment for the fine-tuning process. The script is not called by FT API. The users need to run the script as a standalone script before they make a call to FT API.
Tasks Include:
- count_tokens
- run tokenization on the dataset
- For IFT task: validate tokenization by running tokenizer + filter on the entire dataset. count the number of tokens. Throws error if there are any empty responses or prompts
- For CPT task: call donwload_text_to_mds.py and count the resulted mds dataset. Note this could take a long time.