LLM-Finetuning-Toolkit
LLM-Finetuning-Toolkit copied to clipboard
cuda device-side runtime error when training on custom dataset for JSON outputs
Describe the bug When attempting to train on this dataset: https://huggingface.co/datasets/azizshaw/text_to_json
To Reproduce Steps to reproduce the behaviour: Checkout main branch Replace the data ingestion portion of llmtune/config.yml with:
data:
file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
path: "azizshaw/text_to_json"
prompt:
>- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
{instruction}
Now create a json object for the following scenario
{input}
prompt_stub:
>- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
{output}
test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
train_test_split_seed: 42
And then run
llmtune run llmtune/config.yml
Expected behavior To my knowledge, this should run without error.
Environment:
- OS: Ubuntu 20.04
- running locally on a 3090
- using the developer poetry environment/shell
This bug doesn't occur on the normal dataset, just on this other one. So, it could be something with a specific token or encoding in this dataset? Or there could be an issue with JSON outputs interfering with YAML syntax in the config.
FYI, this dataset also doesn't work. It fails for another reason, TypeError somewhere else. This could be unrelated, but I'm wondering if the method for injecting prompts/responses is robust to stringified JSON.
# Data Ingestion -------------------
data:
file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
path: "growth-cadet/jobpost_signals-to-json_test_mistral01gen"
prompt:
>- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
Given the following job posting, convert the text into a JSON object, with relevant fields.
## Job posting
{context}
## JSON
prompt_stub:
>- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
{mistral01_gen}
test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
train_test_split_seed: 42
@SinclairHudson Thanks for flagging this issue. For "azizshaw/text_to_json", can you attach the error message? It ran fine for me.
For "growth-cadet/jobpost_signals-to-json_test_mistral01gen", I've identified issue with table display where int and float types weren't converted to str properly. I've patched this issue under https://github.com/georgian-io/LLM-Finetuning-Toolkit/pull/172
device-side.txt I've included the whole output, both stdout and stderr, for the "azizshaw/text_to_json" case.
Can you run transformers-cli env and paste in the output?
Also, can you attach the config as well?
With above info - I will try to replicate and debug on my end.
Also, it could be an issue due to using multiple GPUs (https://github.com/huggingface/transformers/issues/22546). If model is small enough, can you try to pin the weights on one GPU via device_map?