subcellular_localization Can you please publish you preprocessing in the data

Nov 21 '18 11:11 chachalin

could you please publish your preprocess of the data

Jun 14 '19 13:06 meichangsu1

Which part of the preprocessing do you refer? Do you mean the encoding from the amino acid sequence to BLOSUM62 or profiles?

Jun 14 '19 13:06 JJAlmagro

yes

Jun 14 '19 13:06 meichangsu1

Thank you for your response！My problem was solved！！！

Jun 14 '19 13:06 chachalin

and i want to know why use BLOSUM62 or profiles instead of one-hot encoding of the source protein ,thank you very much

Jun 14 '19 13:06 meichangsu1

Thank you for your response！My problem was solved！！！

how do you solved your problem,I am very interested in it

Jun 14 '19 13:06 meichangsu1

Thank you for your response！My problem was solved！！！

how do you solved your problem,I am very interested in it

As you know，using BLOSUM62 and it's profiles.But you can ask the author that why he make it use.

Jun 14 '19 13:06 chachalin

Thank you for your response！My problem was solved！！！

how do you solved your problem,I am very interested in it

As you know，using BLOSUM62 and it's profiles.But you can ask the author that why he make it use.

那请问您是怎么获得它的 BLOSUM62和protein profiles的呢，生物信息小白只知道BLOSUM62是用来做序列相似度比较的

Jun 14 '19 14:06 meichangsu1

To create the protein profiles you can use PROFILpro (http://download.igb.uci.edu). I can add later the function that I used to encode the amino acid sequence into a matrix (the input to the neural network) if that is what you want.

The disadvantage of using one-hot encoding is that this assumes that all the amino acids are equally different between each other. However, this is not the case as some amino acids share some properties and therefore substituting an amino acid with another one with similar properties will have a smaller effect on the protein function or structure. Therefore, we include this information by encoding the protein using BLOSUM62 or protein profiles, as similar amino acids will have a similar representation in these matrices.

Jun 14 '19 14:06 JJAlmagro

yes ,that is just what i want,if you can add it,Thank you very much！

Jun 14 '19 14:06 meichangsu1

@JJAlmagro I would also greatly appreciate seeing how protein sequences are encoded into a matrix form to be used as input for the neural network.

Aug 12 '19 17:08 murakdar

To create the protein profiles you can use PROFILpro (http://download.igb.uci.edu). I can add later the function that I used to encode the amino acid sequence into a matrix (the input to the neural network) if that is what you want.

The disadvantage of using one-hot encoding is that this assumes that all the amino acids are equally different between each other. However, this is not the case as some amino acids share some properties and therefore substituting an amino acid with another one with similar properties will have a smaller effect on the protein function or structure. Therefore, we include this information by encoding the protein using BLOSUM62 or protein profiles, as similar amino acids will have a similar representation in these matrices.

@JJAlmagro I would appreciate providing the public the function that you used to convert all protein sequences into 400x20 for each protein? Thanks in advance!

Jan 25 '20 15:01 A-Alaa

@JJAlmagro It is mentioned in the paper under a figure:

proteins shorter than 1000 amino acids are padded from the middle, so the N-terminus and C-terminus align. Proteins longer than 1000 amino acids have the middle part removed.

So this approach wasn't only used for the purpose of visualization, but also for training part to unify the proteins lengths, whether you choose 1000 or 400 as a global length, right? If so, IMO the preprocessing section in the paper is missing such a statement.

Jan 26 '20 19:01 A-Alaa

Did anyone ever get the function on converting the encoded amino acid sequences into a matrix? @JJAlmagro I am not sure if I can easily use DeepLoc for retraining with novel data without this function. Thank you!

Oct 29 '21 14:10 budzakj

Did anyone ever get the function on converting the encoded amino acid sequences into a matrix? @JJAlmagro I am not sure if I can easily use DeepLoc for retraining with novel data without this function. Thank you!

Have you solved this problem? I need help, too.

Nov 13 '21 21:11 BranchW

Could you please share the code regarding how you crop the protein sequences which longer then 1000? Since the longest protein in the Deeploc dataset is 13100.

Jun 19 '22 22:06 yuzhiguo07