LLM-Finetuning-Toolkit icon indicating copy to clipboard operation
LLM-Finetuning-Toolkit copied to clipboard

cuda device-side runtime error when training on custom dataset for JSON outputs

Open SinclairHudson opened this issue 1 year ago • 2 comments

Describe the bug When attempting to train on this dataset: https://huggingface.co/datasets/azizshaw/text_to_json

To Reproduce Steps to reproduce the behaviour: Checkout main branch Replace the data ingestion portion of llmtune/config.yml with:

data:
  file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
  path: "azizshaw/text_to_json"
  prompt:
    >- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
    {instruction}
    Now create a json object for the following scenario
    {input}
  prompt_stub:
    >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
    {output}
  test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
  train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
  train_test_split_seed: 42

And then run

llmtune run llmtune/config.yml

Expected behavior To my knowledge, this should run without error.

Environment:

  • OS: Ubuntu 20.04
  • running locally on a 3090
  • using the developer poetry environment/shell

This bug doesn't occur on the normal dataset, just on this other one. So, it could be something with a specific token or encoding in this dataset? Or there could be an issue with JSON outputs interfering with YAML syntax in the config.

SinclairHudson avatar May 12 '24 22:05 SinclairHudson

FYI, this dataset also doesn't work. It fails for another reason, TypeError somewhere else. This could be unrelated, but I'm wondering if the method for injecting prompts/responses is robust to stringified JSON.

# Data Ingestion -------------------
data:
  file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
  path: "growth-cadet/jobpost_signals-to-json_test_mistral01gen"
  prompt:
    >- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
    Given the following job posting, convert the text into a JSON object, with relevant fields.
    ## Job posting
    {context}
    ## JSON
  prompt_stub:
    >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
    {mistral01_gen}
  test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
  train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
  train_test_split_seed: 42

SinclairHudson avatar May 12 '24 23:05 SinclairHudson

@SinclairHudson Thanks for flagging this issue. For "azizshaw/text_to_json", can you attach the error message? It ran fine for me.

For "growth-cadet/jobpost_signals-to-json_test_mistral01gen", I've identified issue with table display where int and float types weren't converted to str properly. I've patched this issue under https://github.com/georgian-io/LLM-Finetuning-Toolkit/pull/172

benjaminye avatar May 13 '24 15:05 benjaminye

device-side.txt I've included the whole output, both stdout and stderr, for the "azizshaw/text_to_json" case.

SinclairHudson avatar May 19 '24 04:05 SinclairHudson

Can you run transformers-cli env and paste in the output? Also, can you attach the config as well?

With above info - I will try to replicate and debug on my end.

Also, it could be an issue due to using multiple GPUs (https://github.com/huggingface/transformers/issues/22546). If model is small enough, can you try to pin the weights on one GPU via device_map?

benjaminye avatar May 21 '24 14:05 benjaminye