subcellular_localization icon indicating copy to clipboard operation
subcellular_localization copied to clipboard

Can you please publish you preprocessing in the data

Open chachalin opened this issue 6 years ago • 16 comments

chachalin avatar Nov 21 '18 11:11 chachalin

could you please publish your preprocess of the data

meichangsu1 avatar Jun 14 '19 13:06 meichangsu1

Which part of the preprocessing do you refer? Do you mean the encoding from the amino acid sequence to BLOSUM62 or profiles?

JJAlmagro avatar Jun 14 '19 13:06 JJAlmagro

yes

meichangsu1 avatar Jun 14 '19 13:06 meichangsu1

Thank you for your response!My problem was solved!!!

chachalin avatar Jun 14 '19 13:06 chachalin

and i want to know why use BLOSUM62 or profiles instead of one-hot encoding of the source protein ,thank you very much

meichangsu1 avatar Jun 14 '19 13:06 meichangsu1

Thank you for your response!My problem was solved!!!

how do you solved your problem,I am very interested in it

meichangsu1 avatar Jun 14 '19 13:06 meichangsu1

Thank you for your response!My problem was solved!!!

how do you solved your problem,I am very interested in it

As you know,using BLOSUM62 and it's profiles.But you can ask the author that why he make it use.

chachalin avatar Jun 14 '19 13:06 chachalin

Thank you for your response!My problem was solved!!!

how do you solved your problem,I am very interested in it

As you know,using BLOSUM62 and it's profiles.But you can ask the author that why he make it use.

那请问您是怎么获得它的 BLOSUM62和protein profiles的呢,生物信息小白只知道BLOSUM62是用来做序列相似度比较的

meichangsu1 avatar Jun 14 '19 14:06 meichangsu1

To create the protein profiles you can use PROFILpro (http://download.igb.uci.edu). I can add later the function that I used to encode the amino acid sequence into a matrix (the input to the neural network) if that is what you want.

The disadvantage of using one-hot encoding is that this assumes that all the amino acids are equally different between each other. However, this is not the case as some amino acids share some properties and therefore substituting an amino acid with another one with similar properties will have a smaller effect on the protein function or structure. Therefore, we include this information by encoding the protein using BLOSUM62 or protein profiles, as similar amino acids will have a similar representation in these matrices.

JJAlmagro avatar Jun 14 '19 14:06 JJAlmagro

yes ,that is just what i want,if you can add it,Thank you very much!

meichangsu1 avatar Jun 14 '19 14:06 meichangsu1

@JJAlmagro I would also greatly appreciate seeing how protein sequences are encoded into a matrix form to be used as input for the neural network.

murakdar avatar Aug 12 '19 17:08 murakdar

To create the protein profiles you can use PROFILpro (http://download.igb.uci.edu). I can add later the function that I used to encode the amino acid sequence into a matrix (the input to the neural network) if that is what you want.

The disadvantage of using one-hot encoding is that this assumes that all the amino acids are equally different between each other. However, this is not the case as some amino acids share some properties and therefore substituting an amino acid with another one with similar properties will have a smaller effect on the protein function or structure. Therefore, we include this information by encoding the protein using BLOSUM62 or protein profiles, as similar amino acids will have a similar representation in these matrices.

@JJAlmagro I would appreciate providing the public the function that you used to convert all protein sequences into 400x20 for each protein? Thanks in advance!

A-Alaa avatar Jan 25 '20 15:01 A-Alaa

@JJAlmagro It is mentioned in the paper under a figure:

proteins shorter than 1000 amino acids are padded from the middle, so the N-terminus and C-terminus align. Proteins longer than 1000 amino acids have the middle part removed.

So this approach wasn't only used for the purpose of visualization, but also for training part to unify the proteins lengths, whether you choose 1000 or 400 as a global length, right? If so, IMO the preprocessing section in the paper is missing such a statement.

A-Alaa avatar Jan 26 '20 19:01 A-Alaa

Did anyone ever get the function on converting the encoded amino acid sequences into a matrix? @JJAlmagro I am not sure if I can easily use DeepLoc for retraining with novel data without this function. Thank you!

budzakj avatar Oct 29 '21 14:10 budzakj

Did anyone ever get the function on converting the encoded amino acid sequences into a matrix? @JJAlmagro I am not sure if I can easily use DeepLoc for retraining with novel data without this function. Thank you!

Have you solved this problem? I need help, too.

BranchW avatar Nov 13 '21 21:11 BranchW

Could you please share the code regarding how you crop the protein sequences which longer then 1000? Since the longest protein in the Deeploc dataset is 13100.

yuzhiguo07 avatar Jun 19 '22 22:06 yuzhiguo07