K-BERT icon indicating copy to clipboard operation
K-BERT copied to clipboard

中文字符的soft position

Open STHSF opened this issue 4 years ago • 4 comments

博主好: 看到你的文章有一些思考, 中文状态下, apple就是苹果, china就是中国, 而bert的输入是字符级别的,那么在做soft position的时候, 苹果和中国, 属于同一个索引还是分开索引, 分开索引的话visible matric应该怎么构建, 谢谢.

STHSF avatar Apr 28 '20 06:04 STHSF

举个栗子:“爱吃苹果(水果)哈“, 每个字的索引是“0, 1, 2, 3,(4,5),4”

在visible matric,“果”是可以看到“苹”的。

具体可以参考visible matric生成的代码

autoliuweijie avatar May 01 '20 13:05 autoliuweijie

请问是否有想过如果用词级别来做soft position呢?如果用词级别来做soft position效果会不会比较好?

WenTingTseng avatar May 02 '20 16:05 WenTingTseng

请问是否有想过如果用词级别来做soft position呢?如果用词级别来做soft position效果会不会比较好?

按词进行soft-position的话,会出现“蜜蜂“/”蜂蜜“无法区别的问题。

autoliuweijie avatar May 03 '20 06:05 autoliuweijie

Hello, How are you enforcing the -inf condition if the two words are not in the same branch ? In the code you are setting both the places as 1, bit shouldn't it be 0 and -inf ?

Calculate Visible Matrix

        visible_matrix = np.zeros((token_num,token_num))
        for item in abs_idx_tree:
            src_ids = item[0]
            for id in src_ids:
                visible_abs_idx = abs_idx_src + [idx for ent in item[1] for idx in ent]
                #print(visible_abs_idx)
                visible_matrix[id,visible_abs_idx] = 1 
            for ent in item[1]:
                for id in ent:
                    visible_abs_idx = ent + src_ids
                    visible_matrix[id,visible_abs_idx] = 1

swarnadeep8597 avatar Nov 13 '22 10:11 swarnadeep8597