inference [Urgent] Mixtral MOE dataset has reference output token length == 0 (or 1 if you count EOS)

There are 4 samples in the reference HF output that has no output other than the EOS.

>>> df = pd.read_pickle("06062024_mixtral_15k_v4.pkl")
>>> df[df['tok_ref_output_len'] == 1]
       dataset            id                                           question                                              input ref_output  ... tok_ref_output stop_sequence tok_stop_sequence tok_input_len tok_ref_output_len
6486  OpenOrca  flan.1662985  You will be given a text below. Complete the t...  <s>[INST] You are an AI assistant. User will y...             ...            [2]          </s>               [2]           146                  1
6489  OpenOrca   flan.403129  Write the last sentence in this story.\n\nFor ...  <s>[INST] You are a helpful assistant, who alw...             ...            [2]          </s>               [2]           192                  1
6630  OpenOrca   flan.778592  How does the sentence end?\n\n(CNN) -- In the ...  <s>[INST] You are an AI assistant. User will y...             ...            [2]          </s>               [2]           267                  1
6733  OpenOrca   flan.387020  What's the most logical way to complete this p...  <s>[INST] You are an AI assistant. You will be...             ...            [2]          </s>               [2]           259                  1

This will cause a bug in the server scenario because when we measure TPOT we use (Total - TTFT) / (#tokens - 1). Even if we count the EOS we will have divide by 0 problem.

Suggested mitigation:

Remove these 4 samples or substitute these with other open_orca rows.
Not removing the 4 samples, but make [2, 2] a legit output (same as receiving [X, 2]. We just need to change TEST06 to allow this.

Given the tightened timeline, I propose we WAR with 2 and fix 1 in the following round. We think the root-cause of this issue is 1) bug with Mixtral-8x7b-instruct model, and 2) the fact that we are using greedy.

Jul 11 '24 21:07 nvzhihanj

@pgmpablo157321 suggested an alternative method, where we increase min_new_token=2 when generating the reference. I did a quick run last weekend, and checked the token length difference between the old run and new run. Although lots of them didn't change, we observe an average of 60 shift in output length shift

new_len = mixtral8x7b_ref_15k_minnew2_df["nv_tok_ref_output_length"].to_numpy()
old_len = mixtral8x7b_ref_15k_minnew2_df["tok_ref_output_len"].to_numpy()
len_diff = np.abs(new_len - old_len)
np.mean(len_diff)
63.06126666666667

This must be relevant to how HF suppresses the EOS and change the token distribution with min_new_token enabled. I don't think we should go that approach.

Jul 15 '24 20:07 nvzhihanj

Nice catch! I checked our both GPU and Gaudi references: exactly the same four samples have issues (flan.1662985, flan.403129, flan.778592, flan.387020).

I guess that prompt format is invalid - I see that there is missing space between <s> and [INST]. Also, new line at the end can help with generating the right output (see https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/blob/main/tokenizer_config.json#L33 and discussion https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/discussions/75 ).

I support the solution proposed by Zhihan. We definitely have to introduce small fixes to the dataset before the next round - it's too late to do it now.

Jul 15 '24 21:07 szutenberg

Update: this issue was discussed in the 7/16/24 Inference Working Group meeting, and the mitigation 2 is agreed upon. We will merge #1778 (which allows 2 EOS in the output) and Pablo will update the reference implementation to reflect the change on the SUT side.

Jul 16 '24 16:07 nvzhihanj

WG comments: To revisit in 5.0

Jul 23 '24 16:07 mrmhodak