subcellular_localization
subcellular_localization copied to clipboard
Can you please publish you preprocessing in the data
could you please publish your preprocess of the data
Which part of the preprocessing do you refer? Do you mean the encoding from the amino acid sequence to BLOSUM62 or profiles?
yes
Thank you for your response!My problem was solved!!!
and i want to know why use BLOSUM62 or profiles instead of one-hot encoding of the source protein ,thank you very much
Thank you for your response!My problem was solved!!!
how do you solved your problem,I am very interested in it
Thank you for your response!My problem was solved!!!
how do you solved your problem,I am very interested in it
As you know,using BLOSUM62 and it's profiles.But you can ask the author that why he make it use.
Thank you for your response!My problem was solved!!!
how do you solved your problem,I am very interested in it
As you know,using BLOSUM62 and it's profiles.But you can ask the author that why he make it use.
那请问您是怎么获得它的 BLOSUM62和protein profiles的呢,生物信息小白只知道BLOSUM62是用来做序列相似度比较的
To create the protein profiles you can use PROFILpro (http://download.igb.uci.edu). I can add later the function that I used to encode the amino acid sequence into a matrix (the input to the neural network) if that is what you want.
The disadvantage of using one-hot encoding is that this assumes that all the amino acids are equally different between each other. However, this is not the case as some amino acids share some properties and therefore substituting an amino acid with another one with similar properties will have a smaller effect on the protein function or structure. Therefore, we include this information by encoding the protein using BLOSUM62 or protein profiles, as similar amino acids will have a similar representation in these matrices.
yes ,that is just what i want,if you can add it,Thank you very much!
@JJAlmagro I would also greatly appreciate seeing how protein sequences are encoded into a matrix form to be used as input for the neural network.
To create the protein profiles you can use PROFILpro (http://download.igb.uci.edu). I can add later the function that I used to encode the amino acid sequence into a matrix (the input to the neural network) if that is what you want.
The disadvantage of using one-hot encoding is that this assumes that all the amino acids are equally different between each other. However, this is not the case as some amino acids share some properties and therefore substituting an amino acid with another one with similar properties will have a smaller effect on the protein function or structure. Therefore, we include this information by encoding the protein using BLOSUM62 or protein profiles, as similar amino acids will have a similar representation in these matrices.
@JJAlmagro I would appreciate providing the public the function that you used to convert all protein sequences into 400x20 for each protein? Thanks in advance!
@JJAlmagro It is mentioned in the paper under a figure:
proteins shorter than 1000 amino acids are padded from the middle, so the N-terminus and C-terminus align. Proteins longer than 1000 amino acids have the middle part removed.
So this approach wasn't only used for the purpose of visualization, but also for training part to unify the proteins lengths, whether you choose 1000 or 400 as a global length, right? If so, IMO the preprocessing section in the paper is missing such a statement.
Did anyone ever get the function on converting the encoded amino acid sequences into a matrix? @JJAlmagro I am not sure if I can easily use DeepLoc for retraining with novel data without this function. Thank you!
Did anyone ever get the function on converting the encoded amino acid sequences into a matrix? @JJAlmagro I am not sure if I can easily use DeepLoc for retraining with novel data without this function. Thank you!
Have you solved this problem? I need help, too.
Could you please share the code regarding how you crop the protein sequences which longer then 1000? Since the longest protein in the Deeploc dataset is 13100.