spaCy-Thai icon indicating copy to clipboard operation
spaCy-Thai copied to clipboard

Can you train Thai Treebanks Dataset?

Open wannaphong opened this issue 4 years ago • 7 comments

I found Thai Treebanks Dataset. thtb_orchidpp.txt file is a treebank dataset from orchid corpus but it is not CoNLLU.

wannaphong avatar Dec 13 '20 18:12 wannaphong

Umm... The dataset seems something like phrase structure. For example, the first line

[S [NP [FIXN การ]] [VP [VACT ประชุม] [PP [RPRE ทาง] [NP [NCMN วิชาการ] [PUNC <space>] [NP [NCMN ครั้ง] [DONM ที่ 1]]]]]]

denotes the phrase tree as shown below.

phrase tree

I trained spaCy-Thai with dependency trees, which are far different from the phrase tree...

KoichiYasuoka avatar Dec 14 '20 10:12 KoichiYasuoka

# text = การประชุมทางวิชาการ ครั้งที่ 1
1	การ	_	PART	FIXN	_	0	root	_	SpaceAfter=No
2	ประชุม	_	VERB	VACT	_	1	acl	_	SpaceAfter=No
3	ทาง	_	ADP	RPRE	_	4	case	_	SpaceAfter=No
4	วิชาการ	_	NOUN	NCMN	_	2	obl	_	_
5	ครั้ง	_	NOUN	NCMN	_	1	list	_	SpaceAfter=No
6	ที่	_	DET	PREL	_	7	det	_	_
7	1	_	NUM	DCNM	_	5	nummod	_	SpaceAfter=No

On the other hand the dependency tree is visualized as:

dependency tree

KoichiYasuoka avatar Dec 14 '20 12:12 KoichiYasuoka

Well, how do we convert the phrase structure and the dependency tree into one another, @wannaphong?

KoichiYasuoka avatar Dec 14 '20 13:12 KoichiYasuoka

Well, how do we convert the phrase structure and the dependency tree into one another, @wannaphong?

Sorry, I do not know because it is beyond the scope of my expertise. I think @korakot should help with this.

wannaphong avatar Dec 14 '20 15:12 wannaphong

It's possible in theory. A constituency tree can be converted to a dependency tree with no ambiguity. For example

VP = V + NP can be converted to V -[dobj]-> NP

But there's no package library to do it for Thai (or even many other languages). You may need to convert them one by one.

You can search google to find some papers and 1 github for this. https://www.google.com/search?q=convert+constituency+tree+to+dependency+tree

Korakot

On Mon, Dec 14, 2020 at 10:49 PM Wannaphong Phatthiyaphaibun < [email protected]> wrote:

Well, how do we convert the phrase structure and the dependency tree into one another, @wannaphong https://github.com/wannaphong?

Sorry, I do not know because it is beyond the scope of my expertise. I think @korakot https://github.com/korakot should help with this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KoichiYasuoka/spaCy-Thai/issues/1#issuecomment-744529368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYCNPVZLBRK2KFGU5MI5YLSUYXX7ANCNFSM4UZWYJOQ .

korakot avatar Dec 21 '20 07:12 korakot

VP = V + NP can be converted to V -[dobj]-> NP

Oh, it looks very nice. But I'm vague that S = NP + VP can be converted into NP <-[nsubj]- VP or NP <-[vocative]- VP or NP -[acl]-> VP...

KoichiYasuoka avatar Dec 21 '20 08:12 KoichiYasuoka

For S = NP + VP It needs to look inside of NP and VP, so that we can know which [rel] it is. It's not ambiguous, though. You need to do a few if-then cases on PoS and word groups. It's a bit labor-intensive to list all cases.

Korakot

On Mon, Dec 21, 2020 at 3:39 PM Koichi Yasuoka [email protected] wrote:

VP = V + NP can be converted to V -[dobj]-> NP

Oh, it looks very nice. But I'm vague that S = NP + VP can be converted into NP <-[nsubj]- VP or NP <-[vocative]- VP or NP -[acl]-> VP...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KoichiYasuoka/spaCy-Thai/issues/1#issuecomment-748847333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYCNPWOVMOQ2IIGOPWE6ATSV4CTHANCNFSM4UZWYJOQ .

korakot avatar Dec 21 '20 08:12 korakot