generative-ai
generative-ai copied to clipboard
Model tuning does not work
Like most of the code uploaded by Google developers , your model tuning code that uses the stackoverflow data fails miserably giving the below errors.
{
"summary": "Found 7 errors in your file. See 'errors' field for specific details.\nValidated 4000 examples for tokenization. Found 7 examples where either 'input_text' or 'output_text' exceeds the model token limits. See 'tokenization_issues' field for some specific examples.\nValidated 1000 examples for RAI. Found 43 examples that has RAI issues. See 'rai_issues' field for some specific examples.\n",
"max_user_input_token_length": 8177,
"tokenization_issues": [
"Row: 122. Token limit exceeded for 'input_text' [tokens: 15851|limit: 8192] or 'output_text' [tokens: 24|limit: 1024]",
"Row: 362. Token limit exceeded for 'input_text' [tokens: 13474|limit: 8192] or 'output_text' [tokens: 19|limit: 1024]",
"Row: 391. Token limit exceeded for 'input_text' [tokens: 10643|limit: 8192] or 'output_text' [tokens: 34|limit: 1024]",
"Row: 528. Token limit exceeded for 'input_text' [tokens: 9351|limit: 8192] or 'output_text' [tokens: 17|limit: 1024]",
"Row: 840. Token limit exceeded for 'input_text' [tokens: 16309|limit: 8192] or 'output_text' [tokens: 33|limit: 1024]",
"Row: 868. Token limit exceeded for 'input_text' [tokens: 20337|limit: 8192] or 'output_text' [tokens: 51|limit: 1024]",
"Row: 1535. Token limit exceeded for 'input_text' [tokens: 8969|limit: 8192] or 'output_text' [tokens: 26|limit: 1024]"
],
"rai_issues": [
"Row: 15. RAI violation. High scores for categories Finance",
"Row: 46. RAI violation. High scores for categories Finance",
"Row: 275. RAI violation. High scores for categories Finance",
"Row: 401. RAI violation. High scores for categories Finance",
"Row: 444. RAI violation. High scores for categories Health",
"Row: 503. RAI violation. High scores for categories Finance",
"Row: 558. RAI violation. High scores for categories Finance",
"Row: 571. RAI violation. High scores for categories Health",
"Row: 848. RAI violation. High scores for categories Finance",
"Row: 934. RAI violation. High scores for categories Finance",
"... there are more cases ..."
],
"errors": [
"Row: 122. exceeds token limit",
"Row: 362. exceeds token limit",
"Row: 391. exceeds token limit",
"Row: 528. exceeds token limit",
"Row: 840. exceeds token limit",
"Row: 868. exceeds token limit",
"Row: 1535. exceeds token limit"
],
"max_user_output_token_length": 79
}
Understood, I am new to this repo but an LLM enthusiast. I can try some reproduction and triage based on a specific use case and code specific run you encountered. Here to help.
I faced similar rai_issues even with private data. It marked when I had a person's name or asked about going to a specific bank website. It went away once I removed those samples from my jsonl file. So, unless these examples were crucial, you could try removing them.