llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Validation

Open XiaohanZhangCMU opened this issue 1 year ago • 0 comments

JIRA: https://databricks.atlassian.net/jira/software/c/projects/STR/issues/STR-141?filter=allissues

This script is useful in scenarios where the FT API data input has been malformed. It acts as a preventive measure to ensure data integrity and helps in cost assessment for the fine-tuning process. The script is not called by FT API. The users need to run the script as a standalone script before they make a call to FT API.

Tasks Include:

  • count_tokens
  • run tokenization on the dataset
    1. For IFT task: validate tokenization by running tokenizer + filter on the entire dataset. count the number of tokens. Throws error if there are any empty responses or prompts
    2. For CPT task: call donwload_text_to_mds.py and count the resulted mds dataset. Note this could take a long time.

XiaohanZhangCMU avatar Jan 08 '24 06:01 XiaohanZhangCMU