Unicode Charecter training issue
I tried to train My model for translating English to Bengali. After Training when I run the code, The output is not Unicode Bengali character.
I Eat Rice (eng)=> আমি ভাত খাই (Bn)
this type of data is input to the model while training. After complete, when I tested the model by inputting "I Eat Rice" I was expecting "আমি ভাত খাই" as output. But instead of this, the model gave me "Ich esse Reis." I dont know what kind of language is this. Its not related to bengali.
I tested the output. It was in the german language. But why its In German Language
model = SimpleT5()
model.from_pretrained(model_type="t5", model_name="t5-base")
path = "D:\\Python\\Quilbot\\Dataset\\translation.csv"
df = pd.read_csv(path, encoding='utf8',quotechar="'")
# df.apply(lambda x: pd.lib.infer_dtype(x.values))
# print(df)
df = df.rename(columns={"headlines": "source_text", "text": "target_text"})
df = df[['source_text', 'target_text']]
# T5 model expects a task related prefix: since it is a summarization task, we will add a prefix "summarize: "
df['source_text'] = "tn2bn: " + df['source_text']
print(df)
train_df, test_df = train_test_split(df, test_size=0.2)
train_df.shape, test_df.shape
print(train_df.shape, test_df.shape)
model.train(train_df=train_df,
eval_df=test_df,
source_max_token_len=128,
target_max_token_len=50,
batch_size=8,
max_epochs=3,
use_gpu=False
)
model.load_model("t5", "outputs/translate", use_gpu=False)
text_to_summarize = "translate: I eat rice."
print(model.predict(text_to_summarize))
I have tested it with the commanding phrase: "tn2bn"
@rahat10120141 : How does your train_df looks like before feeding to model?
T5 Doesn't have an English to Bengali translation. From the beginning, it was giving me German result