ChengyuBERT
ChengyuBERT copied to clipboard
What is "scope", "num" columns in the corpus?
Hi, may I ask what those "scope", "num" columns stand for?
In "idioms_pretrain.json" ,
idiom num explanation 偃武崇文 0 停息武备,崇尚文教。 洪乔捎书 0 指言而无信的人。 南郭先生 103 比喻无才而占据其位的人。
In "idioms_scopes.tsv",
scope idiom id Scope I 见义勇为 0 Scope II 偃武崇文 3848 Scope III 亏于一篑 33237
In "idiom_synonyms.tsv",
query synonym query_id synonym_id overlapping 黯然销魂 六神无主 14726 1333 0 黯然销魂 丧魂失魄 14726 2704 1 塞翁失马,焉知非福 塞翁失马,安知非福 24524 32175 8
I thought "overlapping" is related with the number of Chinese character overlapped, but the last one shows 8, which is presumably 7.
Thanks!
-
numis the frequency of an idiom on the ChengyuCorpus which is released in Two-stage. -
scopeis defined as the following, which is removed from the camera-ready version as Scope III is not used, we will share the definition here.

-
This is due to the
,is also used to compute overlapping.