CodeT5 icon indicating copy to clipboard operation
CodeT5 copied to clipboard

Generate different representations between Python and Java code

Open Hkh9966 opened this issue 2 years ago • 1 comments

When I executed model.generate(**encoding, max_length=128) in my script, I found that by default only python code can be generated correctly, while Java code only has completion functions.

Gen python code: 图片

Gen java code: 图片

Here is my script:

import torch
from datetime import datetime
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = "cuda"  # for GPU usage or "cpu" for CPU usage
checkpoint = "Salesforce/codet5p-2b"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)


while True:
    user_input = input("Input: ")
    if user_input == "exit":
        break

    encoding = tokenizer(user_input, return_tensors="pt").to(device)
    encoding['decoder_input_ids'] = encoding['input_ids'].clone()
    outputs = model.generate(**encoding, max_length=128)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Did I use it wrong? Is there any way to tell the model that generate java code by default?

Hkh9966 avatar Aug 25 '23 07:08 Hkh9966

Hello, may I know which model do you use? For CodeT5+ 2B/6B/16B, they are further finetuned on Python code and more suitable for Python code generation/completion.

yuewang-cuhk avatar Aug 30 '23 09:08 yuewang-cuhk