ChengyuBERT icon indicating copy to clipboard operation
ChengyuBERT copied to clipboard

What is "scope", "num" columns in the corpus?

Open karmalet opened this issue 3 years ago • 1 comments
trafficstars

Hi, may I ask what those "scope", "num" columns stand for?

In "idioms_pretrain.json" ,

idiom num explanation 偃武崇文 0 停息武备,崇尚文教。 洪乔捎书 0 指言而无信的人。 南郭先生 103 比喻无才而占据其位的人。

In "idioms_scopes.tsv",

scope idiom id Scope I 见义勇为 0 Scope II 偃武崇文 3848 Scope III 亏于一篑 33237

In "idiom_synonyms.tsv",

query synonym query_id synonym_id overlapping 黯然销魂 六神无主 14726 1333 0 黯然销魂 丧魂失魄 14726 2704 1 塞翁失马,焉知非福 塞翁失马,安知非福 24524 32175 8

I thought "overlapping" is related with the number of Chinese character overlapped, but the last one shows 8, which is presumably 7.

Thanks!

karmalet avatar Feb 06 '22 05:02 karmalet

  1. num is the frequency of an idiom on the ChengyuCorpus which is released in Two-stage.

  2. scope is defined as the following, which is removed from the camera-ready version as Scope III is not used, we will share the definition here. 企业微信截图_16442221571111 image

  3. This is due to the is also used to compute overlapping.

Vimos avatar Feb 07 '22 08:02 Vimos