FastChat fix: learn the stop tokens when training.

Why are these changes needed?

Some models need to specifically learn to generate the stop tokens. Otherwise these trained models will not stop when serving. This is a model specific behavior, not all models need that. But for compatibility, I think it is better to make this setting as default.

Currently, different models behavior:

Yi-34b: no need to learn the stop tokens.
Qwen1.5-14b: need to learn the stop tokens.
Mistral-7B-Instruct-v0.2: need to learn the stop token </s> as we guessed in https://github.com/lm-sys/FastChat/issues/3055

Testing with below models:

Mistral-7B-Instruct-v0.2

1 	 -100 	 <s>
733 	 -100 	 [
16289 	 -100 	 INST
28793 	 -100 	 ]
995 	 -100 	 You
460 	 -100 	 are
396 	 -100 	 an
16107 	 -100 	 AI
28723 	 -100 	 .
13 	 -100 	 

3195 	 -100 	 What
349 	 -100 	 is
582 	 -100 	 up
28804 	 -100 	 ?
733 	 -100 	 [
28748 	 -100 	 /
16289 	 -100 	 INST
28793 	 -100 	 ]
22557 	 22557 	 Hello
28808 	 28808 	 !
1602 	 1602 	 How
541 	 541 	 can
315 	 315 	 I
1316 	 1316 	 help
368 	 368 	 you
3154 	 3154 	 today
28804 	 28804 	 ?
2 	 2 	 </s>
733 	 -100 	 [
16289 	 -100 	 INST
28793 	 -100 	 ]
6526 	 -100 	 Who
460 	 -100 	 are
368 	 -100 	 you
28804 	 -100 	 ?
733 	 -100 	 [
28748 	 -100 	 /
16289 	 -100 	 INST
28793 	 -100 	 ]
995 	 995 	 You
541 	 541 	 can
1034 	 1034 	 call
528 	 528 	 me
17862 	 17862 	 Vic
5892 	 5892 	 una
28725 	 28725 	 ,
304 	 304 	 and
315 	 315 	 I
403 	 403 	 was
10898 	 10898 	 trained
486 	 486 	 by
23292 	 23292 	 Large
8871 	 8871 	 Model
17259 	 17259 	 Systems
21919 	 21919 	 Organization
325 	 325 	 (
28758 	 28758 	 L
3477 	 3477 	 MS
28802 	 28802 	 Y
28735 	 28735 	 S
28731 	 28731 	 )
15334 	 15334 	 researchers
390 	 390 	 as
264 	 264 	 a
3842 	 3842 	 language
2229 	 2229 	 model
28723 	 28723 	 .
2 	 2 	 </s>
733 	 -100 	 [
16289 	 -100 	 INST
28793 	 -100 	 ]
5801 	 -100 	 Good
17664 	 -100 	 bye
733 	 -100 	 [
28748 	 -100 	 /
16289 	 -100 	 INST
28793 	 -100 	 ]
5801 	 5801 	 Good
17664 	 17664 	 bye
28808 	 28808 	 !
1047 	 1047 	 If
368 	 368 	 you
506 	 506 	 have
707 	 707 	 any
680 	 680 	 more
4224 	 4224 	 questions
297 	 297 	 in
272 	 272 	 the
3437 	 3437 	 future
28725 	 28725 	 ,
949 	 949 	 don
28742 	 28742 	 '
28707 	 28707 	 t
10816 	 10816 	 hes
9647 	 9647 	 itate
298 	 298 	 to
1460 	 1460 	 ask
28723 	 28723 	 .
2 	 2 	 </s>
0 	 -100 	 <unk>

Llama2

1 	 -100 	 <s>
518 	 -100 	 [
25580 	 -100 	 INST
29962 	 -100 	 ]
3532 	 -100 	 <<
14816 	 -100 	 SY
29903 	 -100 	 S
6778 	 -100 	 >>
13 	 -100 	 

3492 	 -100 	 You
526 	 -100 	 are
385 	 -100 	 an
319 	 -100 	 A
29902 	 -100 	 I
29889 	 -100 	 .
13 	 -100 	 

29966 	 -100 	 <
829 	 -100 	 </
14816 	 -100 	 SY
29903 	 -100 	 S
6778 	 -100 	 >>
13 	 -100 	 

13 	 -100 	 

5618 	 -100 	 What
338 	 -100 	 is
701 	 -100 	 up
29973 	 -100 	 ?
518 	 -100 	 [
29914 	 -100 	 /
25580 	 -100 	 INST
29962 	 -100 	 ]
15043 	 15043 	 Hello
29991 	 29991 	 !
1128 	 1128 	 How
508 	 508 	 can
306 	 306 	 I
1371 	 1371 	 help
366 	 366 	 you
9826 	 9826 	 today
29973 	 29973 	 ?
29871 	 29871 	 
2 	 2 	 </s>
1 	 1 	 <s>
518 	 -100 	 [
25580 	 -100 	 INST
29962 	 -100 	 ]
11644 	 -100 	 Who
526 	 -100 	 are
366 	 -100 	 you
29973 	 -100 	 ?
518 	 -100 	 [
29914 	 -100 	 /
25580 	 -100 	 INST
29962 	 -100 	 ]
887 	 887 	 You
508 	 508 	 can
1246 	 1246 	 call
592 	 592 	 me
13423 	 13423 	 Vic
4347 	 4347 	 una
29892 	 29892 	 ,
322 	 322 	 and
306 	 306 	 I
471 	 471 	 was
16370 	 16370 	 trained
491 	 491 	 by
8218 	 8218 	 Lar
479 	 479 	 ge
8125 	 8125 	 Model
23985 	 23985 	 Systems
9205 	 9205 	 Organ
2133 	 2133 	 ization
313 	 313 	 (
29931 	 29931 	 L
4345 	 4345 	 MS
21554 	 21554 	 YS
29897 	 29897 	 )
5925 	 5925 	 research
414 	 414 	 ers
408 	 408 	 as
263 	 263 	 a
4086 	 4086 	 language
1904 	 1904 	 model
29889 	 29889 	 .
29871 	 29871 	 
2 	 2 	 </s>
1 	 1 	 <s>
518 	 -100 	 [
25580 	 -100 	 INST
29962 	 -100 	 ]
7197 	 -100 	 Good
26966 	 -100 	 bye
518 	 -100 	 [
29914 	 -100 	 /
25580 	 -100 	 INST
29962 	 -100 	 ]
7197 	 7197 	 Good
26966 	 26966 	 bye
29991 	 29991 	 !
960 	 960 	 If
366 	 366 	 you
505 	 505 	 have
738 	 738 	 any
901 	 901 	 more
5155 	 5155 	 questions
297 	 297 	 in
278 	 278 	 the
5434 	 5434 	 future
29892 	 29892 	 ,
1016 	 1016 	 don
29915 	 29915 	 '
29873 	 29873 	 t
19066 	 19066 	 hes
10388 	 10388 	 itate
304 	 304 	 to
2244 	 2244 	 ask
29889 	 29889 	 .
29871 	 29871 	 
2 	 2 	 </s>
1 	 1 	 <s>
0 	 -100 	 <unk>

Qwen1.5-14b

151644 	 -100 	 <|im_start|>
8948 	 -100 	 system
198 	 -100 	 

2610 	 -100 	 You
525 	 -100 	  are
458 	 -100 	  an
15235 	 -100 	  AI
13 	 -100 	 .
151645 	 -100 	 <|im_end|>
198 	 -100 	 

151644 	 -100 	 <|im_start|>
872 	 -100 	 user
198 	 -100 	 

3838 	 -100 	 What
374 	 -100 	  is
705 	 -100 	  up
30 	 -100 	 ?
151645 	 -100 	 <|im_end|>
198 	 -100 	 

151644 	 -100 	 <|im_start|>
77091 	 -100 	 assistant
198 	 -100 	 

9707 	 9707 	 Hello
0 	 0 	 !
2585 	 2585 	  How
646 	 646 	  can
358 	 358 	  I
1492 	 1492 	  help
498 	 498 	  you
3351 	 3351 	  today
30 	 30 	 ?
151645 	 151645 	 <|im_end|>
198 	 198 	 

151644 	 -100 	 <|im_start|>
872 	 -100 	 user
198 	 -100 	 

15191 	 -100 	 Who
525 	 -100 	  are
498 	 -100 	  you
30 	 -100 	 ?
151645 	 -100 	 <|im_end|>
198 	 -100 	 

151644 	 -100 	 <|im_start|>
77091 	 -100 	 assistant
198 	 -100 	 

2610 	 2610 	 You
646 	 646 	  can
1618 	 1618 	  call
752 	 752 	  me
43747 	 43747 	  Vic
8565 	 8565 	 una
11 	 11 	 ,
323 	 323 	  and
358 	 358 	  I
572 	 572 	  was
16176 	 16176 	  trained
553 	 553 	  by
20286 	 20286 	  Large
4903 	 4903 	  Model
14917 	 14917 	  Systems
20395 	 20395 	  Organization
320 	 320 	  (
43 	 43 	 L
4826 	 4826 	 MS
9394 	 9394 	 YS
8 	 8 	 )
11811 	 11811 	  researchers
438 	 438 	  as
264 	 264 	  a
4128 	 4128 	  language
1614 	 1614 	  model
13 	 13 	 .
151645 	 151645 	 <|im_end|>
198 	 198 	 

151644 	 -100 	 <|im_start|>
872 	 -100 	 user
198 	 -100 	 

15216 	 -100 	 Good
28374 	 -100 	 bye
151645 	 -100 	 <|im_end|>
198 	 -100 	 

151644 	 -100 	 <|im_start|>
77091 	 -100 	 assistant
198 	 -100 	 

15216 	 15216 	 Good
28374 	 28374 	 bye
0 	 0 	 !
1416 	 1416 	  If
498 	 498 	  you
614 	 614 	  have
894 	 894 	  any
803 	 803 	  more
4755 	 4755 	  questions
304 	 304 	  in
279 	 279 	  the
3853 	 3853 	  future
11 	 11 	 ,
1513 	 1513 	  don
944 	 944 	 't
38566 	 38566 	  hesitate
311 	 311 	  to
2548 	 2548 	  ask
13 	 13 	 .
151645 	 151645 	 <|im_end|>
198 	 198 	 

151643 	 -100 	 <|endoftext|>

Related issue number (if applicable)

https://github.com/lm-sys/FastChat/issues/3055

Checks

[x] I've run format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed.
[ ] I've made sure the relevant tests are passing (if applicable).

Feb 18 '24 14:02 congchan

Hi @christobill, could you help to test with your models?

Feb 18 '24 14:02 congchan

@congchan yes working with my models ie mistral-7b and vicuna-7b on default data/dummy_conversation.json and my custom data. No loops on your branch!

Thank you :pray:

Feb 18 '24 22:02 christobill

I think llama2 also have the same can't stop situation. So for llama2 </s><s> is the stop word? Test on LLAMA2, The prompt: [INST] <<SYS>> hi. <</SYS>>Evaluate translation from English #This section # to Japanese #本节2# [/INST]

The response:
Reason: Translation aligns with the source '
'string;#section# correctly translated as #节2 Reason: Translation aligns with the source string;#"

Reason: Translation aligns with the source was repeated.

Feb 21 '24 12:02 Oscarjia

I think llama2 also have the same situation. So for llama2 </s><s> is the stop word? Test on LLAMA2, The prompt: [INST] <<SYS>> hi. <</SYS>>Evaluate translation from English #This section # to Japanese #本节2# [/INST]
The response:
Reason: Translation aligns with the source '
'string;#section# correctly translated as #节2 Reason: Translation aligns with the source string;#"

Yes, the stop tokens for llama2 is </s>. In your example, the system and user input seems to be being swapped???

Feb 22 '24 04:02 congchan

I think llama2 also have the same situation. So for llama2 </s><s> is the stop word? Test on LLAMA2, The prompt: [INST] <<SYS>>hi<</SYS>>Evaluate translation from English #This section # to Japanese #本节2# [/INST]
The response:
Reason: Translation aligns with the source '
'string;#section# correctly translated as #节2 Reason: Translation aligns with the source string;#"
Yes, the stop tokens for llama2 is </s>. In your example, the system and user input seems to be being swapped???

Hi, @congchan Actually not swapped. I am not clearly set the system prompt so just set hi, and my task is to Evaluate translation from English #This section # to Japanese #本节2#

LLama2's stop token is </s> and after train the model , i found tokenizer_config.json have the following setting, so bos token will auto add, eos token will not add.

  "add_bos_token": true,
  "add_eos_token": false,

My question is that in you list Llama2 example tokenize string was end with bos token <s> does this right?

29871 	 29871 	 
2 	 2 	 </s>
1 	 1 	 <s>
0 	 -100 	 <unk>

Feb 23 '24 00:02 Oscarjia

HI, I mean the official stop tokens is indeed </s>, but as you can see the conversation class has defined llama2 template with </s><s> as stop string, for compatibility, my code will also train on </s><s>. The final results is the same, when the model has learned to generate </s><s>, the server and client will detect </s><s> and stop streaming from the models.

Per the "add_bos_token": true,, My code will not change the bos behavior. I have checked the official Llama-2-13b-hf and Llama-2-7b-hf models on huggingface, both contains "add_bos_token": true, by default.

Feb 23 '24 07:02 congchan

HI, I mean the official stop tokens is indeed </s>, but as you can see the conversation class has defined llama2 template with </s><s> as stop string, for compatibility, my code will also train on </s><s>. The final results is the same, when the model has learned to generate </s><s>, the server and client will detect </s><s> and stop streaming from the models.

Per the "add_bos_token": true,, My code will not change the bos behavior. I have checked the official Llama-2-13b-hf and Llama-2-7b-hf models on huggingface, both contains "add_bos_token": true, by default.

Yeah, i see, but i still think </s><s> is defined for separating conversation, not really want to set it as stop token.

Feb 23 '24 13:02 Oscarjia

Hi @infwinston, could you have a look? :pray:

Mar 22 '24 14:03 christobill

Hi @infwinston , this PR is ready to be merged, could you help to have a final reviews and merged, so as the associated doc PR https://github.com/lm-sys/FastChat/pull/3139

These features should be able to solve https://github.com/lm-sys/FastChat/issues/2861, and https://github.com/lm-sys/FastChat/issues/2918

Apr 03 '24 07:04 congchan

@congchan
Could you add the deepspeed zero3 support on the triain_with_template?

Do you think if it should add

 if trainer.is_deepspeed_enabled:
        trainer.save_model()

  if trainer.is_deepspeed_enabled:
        trainer.save_model()
    else:
        safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)

Apr 06 '24 13:04 Oscarjia

@congchan Could you add the deepspeed zero3 support on the triain_with_template?

Do you think if it should add
 if trainer.is_deepspeed_enabled:
        trainer.save_model()
  if trainer.is_deepspeed_enabled:
        trainer.save_model()
    else:
        safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)

Hi, thanks for reminding, I did not notice that the original train.py has add these lines.

Apr 07 '24 02:04 congchan

FastChat FastChat copied to clipboard

fix: learn the stop tokens when training.

Why are these changes needed?

Related issue number (if applicable)

Checks

FastChat
FastChat copied to clipboard