DeepSeek-Coder icon indicating copy to clipboard operation
DeepSeek-Coder copied to clipboard

What's the pad token for deepseek-coder

Open tonyaw opened this issue 1 year ago • 2 comments

Dear experts, I found there are two pad tokens in deepseek-coder. What's the difference between them? When I need to use pad token, which one shall I use?

  • tokenizer.json
{
      "id": 32018,
      "content": "<pad>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
  • tokenizer_config.json
  "pad_token": {                                                                                                                                                                                                   
    "__type": "AddedToken",                                                                                                                                                                                        
    "content": "<|end▁of▁sentence|>",                                                                                                                                                                            
    "lstrip": false,                                                                                                                                                                                               
    "normalized": true,                                                                                                                                                                                            
    "rstrip": false,                                                                                                                                                                                               
    "single_word": false                                                                                                                                                                                           
  },                                  

Also, why the second pad token is same as token 32014? I assume it is on purpose. Could you please help to explain the reason?

    {                                                                                                                                                                                                              
      "id": 32013,                                                                                                                                                                                                 
      "content": "<|begin▁of▁sentence|>",                                                                                                                                                                        
      "single_word": false,                                                                                                                                                                                        
      "lstrip": false,                                                                                                                                                                                             
      "rstrip": false,                                                                                                                                                                                             
      "normalized": true,                                                                                                                                                                                          
      "special": true                                                                                                                                                                                              
    },                                                                                                                                                                                                             
    {                                                                                                                                                                                                              
      "id": 32014,                                                                                                                                                                                                 
      "content": "<|end▁of▁sentence|>",                                                                                                                                                                          
      "single_word": false,                                                                                                                                                                                        
      "lstrip": false,                                                                                                                                                                                             
      "rstrip": false,                                                                                                                                                                                             
      "normalized": true,                                                                                                                                                                                          
      "special": true                                                                                                                                                                                              
    }, 

tonyaw avatar Jan 02 '24 08:01 tonyaw

same question not clear about the same pad token with eos token in 33b-code-base https://huggingface.co/deepseek-ai/deepseek-coder-33b-base/blob/main/tokenizer_config.json but not the same ( pad token and eos token ) in instruct models: https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/blob/main/tokenizer_config.json

code snippet from instruct model: "eos_token": { "__type": "AddedToken", "content": "<|EOT|>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "legacy": true, "model_max_length": 16384, "pad_token": { "__type": "AddedToken", "content": "<|end▁of▁sentence|>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }

netrookiecn avatar Jan 04 '24 05:01 netrookiecn

Related question, how does the FIM model learn to stop generation when the EOS and PAD tokens are the same, and so the model never learns to predict the EOS token as it is always masked?

zhzhang avatar May 29 '24 18:05 zhzhang