DeepGraphGO icon indicating copy to clipboard operation
DeepGraphGO copied to clipboard

关于困难蛋白质

Open zhanght28 opened this issue 1 year ago • 6 comments

the definition of the difficult proteins is: the sequence identity of the protein (in the training set) most similar (homologous) to a difficult protein is less than 60%. 你好,请问困难蛋白质的数据是通过max(hsp.identities / rec.query_length for hsp in alignment.hsps) < 0.6得到的吗? 我基于此得到的cc mf bp上的困难蛋白质在数量上和论文中给出的有10个左右的偏差。

zhanght28 avatar Apr 16 '24 15:04 zhanght28

不是很确定为什么,BLAST输出就有一个0~1之间的identity,然后cutoff是0.6,我用的BLAST迭代次数是1

yourh avatar Apr 16 '24 15:04 yourh

感谢您的回复,我是用测试集的psiblast的查询结果xx-test-ppi-blast-out.xml为依据查询的,可能需要用psiblast跑一下训练集的结果?

zhanght28 avatar Apr 17 '24 07:04 zhanght28

哦,是,要跑训练集的

yourh avatar Apr 17 '24 07:04 yourh

identity是不是也要根据blast的结果进一步计算得到呢

zhanght28 avatar Apr 17 '24 09:04 zhanght28

是的,BLAST的输出结果里直接就有identity,然后是选所有hsp里最大的

yourh avatar Apr 17 '24 14:04 yourh

我是通过: max(hsp.identities / rec.query_length for hsp in alignment.hsps) 计算的,这个结果计算出来有偏差,所以我在考虑是不是计算方式有问题

zhanght28 avatar Apr 17 '24 15:04 zhanght28