genomelake
genomelake copied to clipboard
Handle ambiguous letters other than N/n, allowed by IUPAC conventions.
https://www.bioinformatics.org/sms/iupac.html. (N/n and other ambig. symbols must be supported)
Maybe it's worth thinking even more general and allow to specify different vocabularies at runtime?
@Avsecz, have you put any additional work into this to be able to process ambiguous nucleotides? I modified basset to code ambiguous bases as a 0.5 in each of the two corresponding nucleotides. I am rewriting this in python/Keras now, and it seems like a change to one_hot_encode_sequence would be needed
@GabrielHoffman I haven't. @jisraeli is it possible to make the one_hot_encode function flexible where one can specify the full dictionary mapping between letters (say "N") and the arrays?
@Avsecz yes. one way to do that is to extend the char2index function in https://github.com/kundajelab/genomelake/blob/master/genomelake/util.pyx#L18 to map each letter to list of (index, value)
pairs and set those in encoded
array in https://github.com/kundajelab/genomelake/blob/master/genomelake/util.pyx#L32. what do you think?
I'm not sure that is sufficient since the main thing is the input matrix needs to be allowed to hold floats (allele frequencies) and not just one-hot. Does the current code/data structure require integer/binary values?
On Wed, Sep 12, 2018 at 11:40 AM Johnny Israeli [email protected] wrote:
@Avsecz https://github.com/Avsecz yes. one way to do that is to extend the char2indxe function in https://github.com/kundajelab/genomelake/blob/master/genomelake/util.pyx#L18 to map each letter to list of (index, value) pairs and set those in encoded array in https://github.com/kundajelab/genomelake/blob/master/genomelake/util.pyx#L32. what do you think?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kundajelab/genomelake/pull/12#issuecomment-420754968, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7Ecyv6I1uFTpwhrWrtf8-cHOlI17vks5uaVUzgaJpZM4TfmHD .
The existing encoded
is a float array https://github.com/kundajelab/genomelake/blob/master/genomelake/util.pyx#L32, so when you map letter or allele frequencies to list of (index, value)
pairs the value
can be a float.
For example, in the current code, the letter N maps to 0.25s in the array.
Ok excellent.
-A
On Wed, Sep 12, 2018 at 1:16 PM Johnny Israeli [email protected] wrote:
For example, in the current code, the letter N maps to 0.25s in the array.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/genomelake/pull/12#issuecomment-420782489, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EWYh09MFGxlyGgy4cwJIh4L1k6gNks5uaWuUgaJpZM4TfmHD .
Can the value be a numpy array or a list? For cases where it's like [.5, .5, 0, 0]?
Then the API could basically be to provide a dictionary {"A": np.array([1, 0, 0, 0]), ...}
etc and one would need char2index
at all. Just hope that dicts are valid in cython and efficiently implemented.
for a case like that, I'd suggest representing it as[(0, 0.5), (1, 0.5)]
and then populate the array, if you want to do it in cython. does this make sense?
ie only set the values you need to set.
I could add functionality to use the format:{"A": np.array([1, 0, 0, 0]), ...}
R is A or G: {"R": np.array([.5, 0, .5, 0]), ...}
and include this in a new version of one_hot_encode_sequence
But I haven't used cython, so I don't understand the syntax [(0, 0.5), (1, 0.5)]
It is a sparse-matrix syntax?
@jisraeli what if you need to set multiple values like @GabrielHoffman asked for 'R'?
@GabrielHoffman the idea I suggested is to map each letter/symbol to a list of (index, value)
pairs (where the value is non-zero) for that position and then set those in the encoding function that has the array. For the 'R' example above, I'd suggest doing [(0, 0.5), (2, 0.5)]
in the symbol mapping function and then set those values in the encoding function. @Avsecz does this make sense? I'm not sure I'm making any sense.
The use case is described in my biorXiv paper: https://www.biorxiv.org/content/early/2018/08/09/389056
I would like to implement the following (text from the methods section):
Personalized genome sequences were constructed using the GRCh37 reference genome with sites modified according to biallelic SNPs in the whole genome sequence using the bcftools consensus command. At homozygous alternate sites the reference allele is simply replaced by the alternative allele. Heterozygous sites are represented using the IUPAC nucleotide ambiguity codes (Comnish-Bowden, 1985), so that for example an A/C heterozygote is indicated with the letter ‘M’. Only biallelic SNPs are considered, so there are 6 ambiguity codes, one for each pair of nucleotides.
Non-ambiguous sites are one-hot coded as a matrix of mostly 0’s with 4 rows corresponding to ‘A’, ‘T’, ‘C’, and ‘G’. Coding a 1 in the ‘T’ row indicates the presence of that nucleotide in the corresponding position in the genome sequence. Ambiguous sites are encoded with a value of 0.5 in the two corresponding rows. Thus the training data does not explicitly include any information about phasing of the SNPs or allele-specific signals.
@jisraeli Yes, I see it now. I will code up and get back to you with any questions.
👍
@GabrielHoffman if you would like to merge what you code up, please include a test case with the new symbols. would love to maintain 90% test coverage :)
map each letter/symbol to a list of (index, value) pairs
@jisraeli. Long day. Now I get it! Somehow over-read the 'list' word. Great idea!:D
👍
I created a new cython function also named one_hot_encode_sequence
with the exact arguments as the original. This function defines each character as a 4 entry array A -> [1,0,0,0]
I also defined 3 other functions and saw that this one was the fastest
-
one_hot_encode_sequence_ambig()
: python version of this function that create hash table of the letter to arrays -
one_hot_encode_sequence_v2()
: experimental function, used as the basis for finalone_hot_encode_sequence()
-
one_hot_encode_sequence_v3()
: Following Johnny's suggestion, only encode non-zero elements
See testing code. test_nucleotide_encoding()
shows that these functions give equivalent values. test_nucleotide_encoding_runtime()
gives timings for each function.
I expected one_hot_encode_sequence_v3()
only encode non-zero elements to be the fastest, but I think it is spending too much time re-initializing the same hash and arrays at each function call. One option to avoid this re-initializing is to use a static variable like in C that is only initialized once, but I couldn't figure this out in python/cython. The other option is to put one_hot_encode_sequence()
in an object that stores this mapping hash. But that would require changing the function calls to one_hot_encode_sequence()
in the rest of the package and dependencies.
Do you have a better way to address this?