kgt5
kgt5 copied to clipboard
pretained-Chinese data
Hello, this work is terrific, and I am happy to find your work. But I have some questions. Could I use your model for Chinese Triples data? If it is could, should I train your model again?
Hi, thanks for your interest!
Yes you could use the model but you would have to train it again. Currently only English/Latin characters were used in the pretraining. You would also probably need to use a different tokenizer, which means training from scratch.
Thanks a lot! How could I know the details of your pretrained model? What I wanted is runs your model with Chinese triples.
You could try the following:
- Convert your KG into verbalized format. This means that for each triple e.g. (obama, president of, USA) in train KG, make 2 lines as follows: a. "predict tail: obama | president of\tUSA" b. "predict head: USA | president of\tobama"
where '\t' is the tab symbol. Put all this in train.txt (make valid, test.txt similarly). Put all the .txt files in data/your_dataset_name folder
- Train using the command provided, with dataset as your_dataset_name
How could I know the details of your pretrained model?
What specific details are you looking for that are not there in the paper or on https://huggingface.co/apoorvumang/kgt5-base-wikikg90mv2 ?
You could try the following:
- Convert your KG into verbalized format. This means that for each triple e.g. (obama, president of, USA) in train KG, make 2 lines as follows: a. "predict tail: obama | president of\tUSA" b. "predict head: USA | president of\tobama"
where '\t' is the tab symbol. Put all this in train.txt (make valid, test.txt similarly). Put all the .txt files in data/your_dataset_name folder
- Train using the command provided, with dataset as your_dataset_name
Thanks for your answering! I will make my data to this format.
How could I know the details of your pretrained model?
What specific details are you looking for that are not there in the paper or on https://huggingface.co/apoorvumang/kgt5-base-wikikg90mv2 ?
Actually, I am a rookie. What I want to know is that if I use Chinese triples, I need to train it again, but what should I to change?only change kgt5/data into my data? And then train your model again?
You would also need to change the tokenizer. The default tokenizer of T5 might not be good enough for chinese (I'm not sure though).
You would also need to change the tokenizer. The default tokenizer of T5 might not be good enough for chinese (I'm not sure though).
Ok, thanks a lot.
What is the entity_strings.txt file? What is its use and how are we mapping it with entities? In the paper, it is talked about using entities and relations description for training. Will be not use pkl file of relations for WikiKG90Mv2?
您还需要更改标记生成器。T5 的默认分词器对于中文来说可能不够好(但我不确定)。
好的,非常感谢。
请问您进行了中文三元组的训练了吗?