PLBART
PLBART copied to clipboard
Missing "java" token in Hugging Face Tokenizer
Hi,
I am trying to replicate the results of PLBART for the code refinement fine-tuning task using Hugging Face. When I tokenize methods that contain the "java" token and then decode them, the "java" token is strangely removed! Here is my code:
code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
tokenizer = model_tokenizer_class.from_pretrained("uclanlp/plbart-base", language_codes="base")
model_inputs = tokenizer([code])
print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
# The code output is: "public void METHOD_1 ( TYPE_1 VAR_1 ) throws .lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
Also, is there any hugging face implementation of the code refinement task using PLBART? My implementation does not achieve the EM and BLEU reported for the test set. I executed the existing fairseq implementation and got EM: 17.67, however my hugging face implementation gets EM: 5.62! What important factors should I check?
Thank you for pointing this out. It is a bug as we can see here, instead FAIRSEQ_LANGUAGE_CODES
should be defined as:
FAIRSEQ_LANGUAGE_CODES = {
"base": ["__java__", "__python__", "__en_XX__"],
"multi": ["__java__", "__python__", "__en_XX__", "__javascript__", "__php__", "__ruby__", "__go__"],
}
Otherwise, the regular token java
in the vocab will be treated as a special token. Now in the following:
print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
Since you do skip_special_tokens=True
, the java
token is removed.
To verify if tokenization works correct, we can do:
from transformers import PLBartTokenizer
code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
tokenizer = PLBartTokenizer.from_pretrained("uclanlp/plbart-base", language_codes="base")
print(tokenizer.tokenize(code))
which outputs:
['▁public', '▁void', '▁METHOD', '_1', '▁(', '▁TYPE', '_1', '▁VAR', '_1', '▁)', '▁throws', 'java', '▁.', 'lang', '.', 'Exception', '▁{', '▁super', '▁.', '▁METHOD', '_1', '▁(', '▁VAR', '_1', '▁)', '▁;', '▁METHOD', '_2', '▁(', '▁VAR', '_1', '▁)', '▁;', '▁}']
And I think the tokenization is fine.
@gchhablani Can you help resolving the bug? The FAIRSEQ_LANGUAGE_CODES should be defined as:
FAIRSEQ_LANGUAGE_CODES = {
"base": ["__java__", "__python__", "__en_XX__"],
"multi": ["__java__", "__python__", "__en_XX__", "__javascript__", "__php__", "__ruby__", "__go__"],
}
Resolved with this PR (https://github.com/huggingface/transformers/pull/19980).