PaddleHelix icon indicating copy to clipboard operation
PaddleHelix copied to clipboard

麻烦问下这两个例子中化合物和蛋白的向量表示怎么获取?

Open lonngxiang opened this issue 2 years ago • 13 comments

https://github.com/PaddlePaddle/PaddleHelix/blob/dev/tutorials/compound_property_prediction_tutorial_cn.ipynb

https://github.com/PaddlePaddle/PaddleHelix/blob/dev/tutorials/protein_pretrain_and_property_prediction_tutorial_cn.ipynb

看代码案例好像最后都只是得出了化合物和蛋白的性质推断,中间的化合物和蛋白的表示怎么获取缺没有说明,能麻烦问下中间要怎么才能获取吗

image 最后的结果维度看着不是蛋白的向量表示 image

lonngxiang avatar Mar 13 '22 03:03 lonngxiang

还有加载VAE生成模型要怎么加载使用呢

lonngxiang avatar Mar 13 '22 06:03 lonngxiang

Hi lonngxiang, as for get the representation of the compound or protein, you can add some codes in the corresponding model.py to get it. For the pretrain_gnn model, it should be done here: https://github.com/PaddlePaddle/PaddleHelix/blob/3368b93fc706dd3fea35887748673abcc668c145/apps/pretrained_compound/pretrain_gnns/src/model.py#L56 (After adding, the main code should change from pred = model(graph) to pred, repr = model(graph), then you can do what you want with the representation). Let me know if you have any other problems :)

Noisyntrain avatar Mar 13 '22 06:03 Noisyntrain

As for loading the model, you can refer to the paddle's offical api of saving and loading model here: https://www.paddlepaddle.org.cn/documentation/docs/zh/faq/save_cn.html#wenti-zengliangxunlianzhong-ruhebaocunmoxinghehuifuxunlian . After you init the VAE model, you can then load the model.

Noisyntrain avatar Mar 13 '22 06:03 Noisyntrain

Hi lonngxiang, as for get the representation of the compound or protein, you can add some codes in the corresponding model.py to get it. For the pretrain_gnn model, it should be done here:

https://github.com/PaddlePaddle/PaddleHelix/blob/3368b93fc706dd3fea35887748673abcc668c145/apps/pretrained_compound/pretrain_gnns/src/model.py#L56

(After adding, the main code should change from pred = model(graph) to pred, repr = model(graph), then you can do what you want with the representation). Let me know if you have any other problems :)

意思是比如取mlp层结果返回吗?

predict_model = "./models/epoch_0.pdparams"
paddle.set_device("gpu")

encoder_model = ProteinEncoderModel(model_config, name='protein')
model = ProteinModel(encoder_model, model_config)
model.load_dict(paddle.load(predict_model))

加载训练模型这代码后面再引入DownstreamModel类方法吗

lonngxiang avatar Mar 13 '22 07:03 lonngxiang

  1. 你可以在model 那里 把retrun pred 换成return pred, graph_repr 就可以获得每个分子的表示了.
  2. 引入DownstreamModel的先后顺序应该不影响内部proteinmodel的加载

Noisyntrain avatar Mar 13 '22 07:03 Noisyntrain

  1. 你可以在model 那里 把retrun pred 换成return pred, graph_repr 就可以获得每个分子的表示了.
  2. 引入DownstreamModel的先后顺序应该不影响内部proteinmodel的加载

上面说的化合物分子通过更改return pred, graph_repr我理解了,那蛋白分子表示获取应该怎么更改呢

lonngxiang avatar Mar 13 '22 07:03 lonngxiang

As for loading the model, you can refer to the paddle's offical api of saving and loading model here: https://www.paddlepaddle.org.cn/documentation/docs/zh/faq/save_cn.html#wenti-zengliangxunlianzhong-ruhebaocunmoxinghehuifuxunlian . After you init the VAE model, you can then load the model.

如果能有加载预训练好的模型怎么使用的示例代码就好了

lonngxiang avatar Mar 13 '22 07:03 lonngxiang

  1. 你可以在model 那里 把retrun pred 换成return pred, graph_repr 就可以获得每个分子的表示了.
  2. 引入DownstreamModel的先后顺序应该不影响内部proteinmodel的加载

上面说的化合物分子通过更改return pred, graph_repr我理解了,那蛋白分子表示获取应该怎么更改呢

您好,蛋白质tutorial的模型推断部分部分介绍了如何使用训练后的模型对给定的蛋白分子进行性质预测。这里可以拿encoder_model的输出encoder_repr作为蛋白分子表示,具体代码如下:

predict_model` = "./models/epoch_0.pdparams"
paddle.set_device("gpu")

encoder_model = ProteinEncoderModel(model_config, name='protein')
model = ProteinModel(encoder_model, model_config)
model.load_dict(paddle.load(predict_model))

tokenizer = ProteinTokenizer()
examples = [
    'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR',
    'KQHTSRGYLHEFDGDPANRCHQSLYKWHDKDCDWLVDWEMKPMDALMETDHQPSMLVHLEQSYKWFCCIKGKPLNFAALLDGWTKITPMAKALYWRDHISEAWLIQCMFEEKILIVRTLMDENGTHKNYFVMSRLCGSCITFEWDSWEAEKPHKVWMGMKNCVSWKRKDVIEMVFERTQWAKWADNIYNWACCPMQVPEIIPFQFFYQTDENFCFKLLMKPCKFYYFSCHHLGHLHCLLKYQWYKGVYLGMRLRVFHKMIVCFHGHWTWVEGNSGIEGRGGIMMHTGITMDCFFDRNIQQSYGGSRWSEQNMKHSQHSRCDPYRTCEPEGTTPEQKCVQRQRIKVRVCHMPEDCLWTSCV',
]

example_ids = [tokenizer.gen_token_ids(example) for example in examples]
max_seq_len = max([len(example_id) for example_id in example_ids]) 
pos = [list(range(1, len(example_id) + 1)) for example_id in example_ids]
pad_to_max_seq_len(example_ids, max_seq_len)
pad_to_max_seq_len(pos, max_seq_len)

texts = paddle.to_tensor(example_ids)
pos = paddle.to_tensor(pos)
encoder_repr = encoder_model(texts, pos)
print(encoder_repr.shape)

linxd5 avatar Mar 13 '22 08:03 linxd5

encoder_repr = encoder_model(texts, pos) print(encoder_repr.shape)

感谢回复,这样得到的shape看是[2, 362, 1024],案例是两个蛋白质氨基酸序列,362是指?整个蛋白的表示是取哪个呢

Tensor(shape=[2, 362, 1024], dtype=float32, place=CPUPlace, stop_gradient=False,
       [[[ 0.00685551, -0.01426834,  0.        , ...,  0.        ,
           0.04076507, -0.03458853],
         [ 0.01100845, -0.02210978,  0.01090379, ...,  0.00580287,
           0.04014656, -0.03438329],
         [ 0.01268578, -0.02638004,  0.01156768, ...,  0.00714200,
           0.        , -0.03445490],
         ...,
         [ 0.00887288, -0.03410659,  0.00728359, ...,  0.        ,
           0.02557690, -0.03082932],
         [ 0.00829512,  0.        ,  0.00664931, ...,  0.01164682,
           0.01859944, -0.02607050],
         [ 0.00752587, -0.03666925,  0.00552156, ...,  0.00950453,
           0.00943645, -0.01668916]],

        [[ 0.00685121, -0.01426677,  0.00772670, ...,  0.00357180,
           0.04075353, -0.03458196],
         [ 0.01100168, -0.02210602,  0.01090166, ...,  0.00579469,
           0.04013380,  0.        ],
         [ 0.        , -0.02637348,  0.01156313, ...,  0.00713670,
           0.03990369, -0.03444009],
         ...,
         [ 0.00887311, -0.03411532,  0.        , ...,  0.01131222,
           0.02557621, -0.03082581],
         [ 0.00829615, -0.03519142,  0.00665414, ...,  0.        ,
           0.01859915, -0.02606762],
         [ 0.00752727, -0.03667324,  0.00552557, ...,  0.00950468,
           0.00943635, -0.01668769]]])

lonngxiang avatar Mar 13 '22 08:03 lonngxiang

encoder_repr = encoder_model(texts, pos) print(encoder_repr.shape)

感谢回复,这样得到的shape看是[2, 362, 1024],案例是两个蛋白质氨基酸序列,362是指?整个蛋白的表示是取哪个呢

Tensor(shape=[2, 362, 1024], dtype=float32, place=CPUPlace, stop_gradient=False,
       [[[ 0.00685551, -0.01426834,  0.        , ...,  0.        ,
           0.04076507, -0.03458853],
         [ 0.01100845, -0.02210978,  0.01090379, ...,  0.00580287,
           0.04014656, -0.03438329],
         [ 0.01268578, -0.02638004,  0.01156768, ...,  0.00714200,
           0.        , -0.03445490],
         ...,
         [ 0.00887288, -0.03410659,  0.00728359, ...,  0.        ,
           0.02557690, -0.03082932],
         [ 0.00829512,  0.        ,  0.00664931, ...,  0.01164682,
           0.01859944, -0.02607050],
         [ 0.00752587, -0.03666925,  0.00552156, ...,  0.00950453,
           0.00943645, -0.01668916]],

        [[ 0.00685121, -0.01426677,  0.00772670, ...,  0.00357180,
           0.04075353, -0.03458196],
         [ 0.01100168, -0.02210602,  0.01090166, ...,  0.00579469,
           0.04013380,  0.        ],
         [ 0.        , -0.02637348,  0.01156313, ...,  0.00713670,
           0.03990369, -0.03444009],
         ...,
         [ 0.00887311, -0.03411532,  0.        , ...,  0.01131222,
           0.02557621, -0.03082581],
         [ 0.00829615, -0.03519142,  0.00665414, ...,  0.        ,
           0.01859915, -0.02606762],
         [ 0.00752727, -0.03667324,  0.00552557, ...,  0.00950468,
           0.00943635, -0.01668769]]])

案例中输入了2个蛋白质,蛋白质最大长度为362,所以encoder_repr的shape是[2, 362, 1024]。你可以拿encoder_repr[0]作为第1个蛋白的表示,encoder_repr[1]作为第2个蛋白的表示。或者你可以直接只输入1个蛋白,然后将整个encoder_repr作为蛋白表示。

linxd5 avatar Mar 13 '22 08:03 linxd5

encoder_repr = encoder_model(texts, pos) print(encoder_repr.shape)

感谢回复,这样得到的shape看是[2, 362, 1024],案例是两个蛋白质氨基酸序列,362是指?整个蛋白的表示是取哪个呢

Tensor(shape=[2, 362, 1024], dtype=float32, place=CPUPlace, stop_gradient=False,
       [[[ 0.00685551, -0.01426834,  0.        , ...,  0.        ,
           0.04076507, -0.03458853],
         [ 0.01100845, -0.02210978,  0.01090379, ...,  0.00580287,
           0.04014656, -0.03438329],
         [ 0.01268578, -0.02638004,  0.01156768, ...,  0.00714200,
           0.        , -0.03445490],
         ...,
         [ 0.00887288, -0.03410659,  0.00728359, ...,  0.        ,
           0.02557690, -0.03082932],
         [ 0.00829512,  0.        ,  0.00664931, ...,  0.01164682,
           0.01859944, -0.02607050],
         [ 0.00752587, -0.03666925,  0.00552156, ...,  0.00950453,
           0.00943645, -0.01668916]],

        [[ 0.00685121, -0.01426677,  0.00772670, ...,  0.00357180,
           0.04075353, -0.03458196],
         [ 0.01100168, -0.02210602,  0.01090166, ...,  0.00579469,
           0.04013380,  0.        ],
         [ 0.        , -0.02637348,  0.01156313, ...,  0.00713670,
           0.03990369, -0.03444009],
         ...,
         [ 0.00887311, -0.03411532,  0.        , ...,  0.01131222,
           0.02557621, -0.03082581],
         [ 0.00829615, -0.03519142,  0.00665414, ...,  0.        ,
           0.01859915, -0.02606762],
         [ 0.00752727, -0.03667324,  0.00552557, ...,  0.00950468,
           0.00943635, -0.01668769]]])

案例中输入了2个蛋白质,蛋白质最大长度为362,所以encoder_repr的shape是[2, 362, 1024]。你可以拿encoder_repr[0]作为第1个蛋白的表示,encoder_repr[1]作为第2个蛋白的表示。或者你可以直接只输入1个蛋白,然后将整个encoder_repr作为蛋白表示。

嗯,362表示有362个token的向量,那整体蛋白的向量表示看求均值?

lonngxiang avatar Mar 13 '22 08:03 lonngxiang

encoder_repr = encoder_model(texts, pos) print(encoder_repr.shape)

感谢回复,这样得到的shape看是[2, 362, 1024],案例是两个蛋白质氨基酸序列,362是指?整个蛋白的表示是取哪个呢

Tensor(shape=[2, 362, 1024], dtype=float32, place=CPUPlace, stop_gradient=False,
       [[[ 0.00685551, -0.01426834,  0.        , ...,  0.        ,
           0.04076507, -0.03458853],
         [ 0.01100845, -0.02210978,  0.01090379, ...,  0.00580287,
           0.04014656, -0.03438329],
         [ 0.01268578, -0.02638004,  0.01156768, ...,  0.00714200,
           0.        , -0.03445490],
         ...,
         [ 0.00887288, -0.03410659,  0.00728359, ...,  0.        ,
           0.02557690, -0.03082932],
         [ 0.00829512,  0.        ,  0.00664931, ...,  0.01164682,
           0.01859944, -0.02607050],
         [ 0.00752587, -0.03666925,  0.00552156, ...,  0.00950453,
           0.00943645, -0.01668916]],

        [[ 0.00685121, -0.01426677,  0.00772670, ...,  0.00357180,
           0.04075353, -0.03458196],
         [ 0.01100168, -0.02210602,  0.01090166, ...,  0.00579469,
           0.04013380,  0.        ],
         [ 0.        , -0.02637348,  0.01156313, ...,  0.00713670,
           0.03990369, -0.03444009],
         ...,
         [ 0.00887311, -0.03411532,  0.        , ...,  0.01131222,
           0.02557621, -0.03082581],
         [ 0.00829615, -0.03519142,  0.00665414, ...,  0.        ,
           0.01859915, -0.02606762],
         [ 0.00752727, -0.03667324,  0.00552557, ...,  0.00950468,
           0.00943635, -0.01668769]]])

案例中输入了2个蛋白质,蛋白质最大长度为362,所以encoder_repr的shape是[2, 362, 1024]。你可以拿encoder_repr[0]作为第1个蛋白的表示,encoder_repr[1]作为第2个蛋白的表示。或者你可以直接只输入1个蛋白,然后将整个encoder_repr作为蛋白表示。

嗯,362表示有362个token的向量,那整体蛋白的向量表示看求均值?

整体蛋白是指?我们这里输入蛋白和蛋白之间是相互独立没有关联的

linxd5 avatar Mar 13 '22 08:03 linxd5

encoder_repr = encoder_model(texts, pos) print(encoder_repr.shape)

感谢回复,这样得到的shape看是[2, 362, 1024],案例是两个蛋白质氨基酸序列,362是指?整个蛋白的表示是取哪个呢

Tensor(shape=[2, 362, 1024], dtype=float32, place=CPUPlace, stop_gradient=False,
       [[[ 0.00685551, -0.01426834,  0.        , ...,  0.        ,
           0.04076507, -0.03458853],
         [ 0.01100845, -0.02210978,  0.01090379, ...,  0.00580287,
           0.04014656, -0.03438329],
         [ 0.01268578, -0.02638004,  0.01156768, ...,  0.00714200,
           0.        , -0.03445490],
         ...,
         [ 0.00887288, -0.03410659,  0.00728359, ...,  0.        ,
           0.02557690, -0.03082932],
         [ 0.00829512,  0.        ,  0.00664931, ...,  0.01164682,
           0.01859944, -0.02607050],
         [ 0.00752587, -0.03666925,  0.00552156, ...,  0.00950453,
           0.00943645, -0.01668916]],

        [[ 0.00685121, -0.01426677,  0.00772670, ...,  0.00357180,
           0.04075353, -0.03458196],
         [ 0.01100168, -0.02210602,  0.01090166, ...,  0.00579469,
           0.04013380,  0.        ],
         [ 0.        , -0.02637348,  0.01156313, ...,  0.00713670,
           0.03990369, -0.03444009],
         ...,
         [ 0.00887311, -0.03411532,  0.        , ...,  0.01131222,
           0.02557621, -0.03082581],
         [ 0.00829615, -0.03519142,  0.00665414, ...,  0.        ,
           0.01859915, -0.02606762],
         [ 0.00752727, -0.03667324,  0.00552557, ...,  0.00950468,
           0.00943635, -0.01668769]]])

案例中输入了2个蛋白质,蛋白质最大长度为362,所以encoder_repr的shape是[2, 362, 1024]。你可以拿encoder_repr[0]作为第1个蛋白的表示,encoder_repr[1]作为第2个蛋白的表示。或者你可以直接只输入1个蛋白,然后将整个encoder_repr作为蛋白表示。

嗯,362表示有362个token的向量,那整体蛋白的向量表示看求均值?

整体蛋白是指?我们这里输入蛋白和蛋白之间是相互独立没有关联的

我意思是类似transformer中表示整句话的cls句向量,这里是怎么表示整个氨基酸串的向量呢

lonngxiang avatar Mar 13 '22 08:03 lonngxiang