kgt5 icon indicating copy to clipboard operation
kgt5 copied to clipboard

pretained-Chinese data

Open Garethyu opened this issue 2 years ago • 10 comments

Hello, this work is terrific, and I am happy to find your work. But I have some questions. Could I use your model for Chinese Triples data? If it is could, should I train your model again?

Garethyu avatar Apr 17 '22 03:04 Garethyu

Hi, thanks for your interest!

Yes you could use the model but you would have to train it again. Currently only English/Latin characters were used in the pretraining. You would also probably need to use a different tokenizer, which means training from scratch.

apoorvumang avatar Apr 17 '22 05:04 apoorvumang

Thanks a lot! How could I know the details of your pretrained model? What I wanted is runs your model with Chinese triples.

Garethyu avatar Apr 17 '22 08:04 Garethyu

You could try the following:

  1. Convert your KG into verbalized format. This means that for each triple e.g. (obama, president of, USA) in train KG, make 2 lines as follows: a. "predict tail: obama | president of\tUSA" b. "predict head: USA | president of\tobama"

where '\t' is the tab symbol. Put all this in train.txt (make valid, test.txt similarly). Put all the .txt files in data/your_dataset_name folder

  1. Train using the command provided, with dataset as your_dataset_name

apoorvumang avatar Apr 17 '22 11:04 apoorvumang

How could I know the details of your pretrained model?

What specific details are you looking for that are not there in the paper or on https://huggingface.co/apoorvumang/kgt5-base-wikikg90mv2 ?

apoorvumang avatar Apr 17 '22 11:04 apoorvumang

You could try the following:

  1. Convert your KG into verbalized format. This means that for each triple e.g. (obama, president of, USA) in train KG, make 2 lines as follows: a. "predict tail: obama | president of\tUSA" b. "predict head: USA | president of\tobama"

where '\t' is the tab symbol. Put all this in train.txt (make valid, test.txt similarly). Put all the .txt files in data/your_dataset_name folder

  1. Train using the command provided, with dataset as your_dataset_name

Thanks for your answering! I will make my data to this format.

Garethyu avatar Apr 17 '22 12:04 Garethyu

How could I know the details of your pretrained model?

What specific details are you looking for that are not there in the paper or on https://huggingface.co/apoorvumang/kgt5-base-wikikg90mv2 ?

Actually, I am a rookie. What I want to know is that if I use Chinese triples, I need to train it again, but what should I to change?only change kgt5/data into my data? And then train your model again?

Garethyu avatar Apr 17 '22 12:04 Garethyu

You would also need to change the tokenizer. The default tokenizer of T5 might not be good enough for chinese (I'm not sure though).

apoorvumang avatar Apr 17 '22 12:04 apoorvumang

You would also need to change the tokenizer. The default tokenizer of T5 might not be good enough for chinese (I'm not sure though).

Ok, thanks a lot.

Garethyu avatar Apr 17 '22 16:04 Garethyu

What is the entity_strings.txt file? What is its use and how are we mapping it with entities? In the paper, it is talked about using entities and relations description for training. Will be not use pkl file of relations for WikiKG90Mv2?

ankush9812 avatar May 27 '22 06:05 ankush9812

您还需要更改标记生成器。T5 的默认分词器对于中文来说可能不够好(但我不确定)。

好的,非常感谢。

请问您进行了中文三元组的训练了吗?

px6927 avatar Oct 17 '23 12:10 px6927