genomelake icon indicating copy to clipboard operation
genomelake copied to clipboard

Handle ambiguous letters other than N/n, allowed by IUPAC conventions.

Open manyu90 opened this issue 6 years ago • 22 comments

https://www.bioinformatics.org/sms/iupac.html. (N/n and other ambig. symbols must be supported)

manyu90 avatar Apr 23 '18 10:04 manyu90

Maybe it's worth thinking even more general and allow to specify different vocabularies at runtime?

Avsecz avatar Apr 23 '18 18:04 Avsecz

@Avsecz, have you put any additional work into this to be able to process ambiguous nucleotides? I modified basset to code ambiguous bases as a 0.5 in each of the two corresponding nucleotides. I am rewriting this in python/Keras now, and it seems like a change to one_hot_encode_sequence would be needed

GabrielHoffman avatar Sep 12 '18 17:09 GabrielHoffman

@GabrielHoffman I haven't. @jisraeli is it possible to make the one_hot_encode function flexible where one can specify the full dictionary mapping between letters (say "N") and the arrays?

Avsecz avatar Sep 12 '18 17:09 Avsecz

@Avsecz yes. one way to do that is to extend the char2index function in https://github.com/kundajelab/genomelake/blob/master/genomelake/util.pyx#L18 to map each letter to list of (index, value) pairs and set those in encoded array in https://github.com/kundajelab/genomelake/blob/master/genomelake/util.pyx#L32. what do you think?

jisraeli avatar Sep 12 '18 18:09 jisraeli

I'm not sure that is sufficient since the main thing is the input matrix needs to be allowed to hold floats (allele frequencies) and not just one-hot. Does the current code/data structure require integer/binary values?

On Wed, Sep 12, 2018 at 11:40 AM Johnny Israeli [email protected] wrote:

@Avsecz https://github.com/Avsecz yes. one way to do that is to extend the char2indxe function in https://github.com/kundajelab/genomelake/blob/master/genomelake/util.pyx#L18 to map each letter to list of (index, value) pairs and set those in encoded array in https://github.com/kundajelab/genomelake/blob/master/genomelake/util.pyx#L32. what do you think?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kundajelab/genomelake/pull/12#issuecomment-420754968, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7Ecyv6I1uFTpwhrWrtf8-cHOlI17vks5uaVUzgaJpZM4TfmHD .

akundaje avatar Sep 12 '18 20:09 akundaje

The existing encoded is a float array https://github.com/kundajelab/genomelake/blob/master/genomelake/util.pyx#L32, so when you map letter or allele frequencies to list of (index, value) pairs the value can be a float.

jisraeli avatar Sep 12 '18 20:09 jisraeli

For example, in the current code, the letter N maps to 0.25s in the array.

jisraeli avatar Sep 12 '18 20:09 jisraeli

Ok excellent.

-A

On Wed, Sep 12, 2018 at 1:16 PM Johnny Israeli [email protected] wrote:

For example, in the current code, the letter N maps to 0.25s in the array.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/genomelake/pull/12#issuecomment-420782489, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EWYh09MFGxlyGgy4cwJIh4L1k6gNks5uaWuUgaJpZM4TfmHD .

akundaje avatar Sep 12 '18 20:09 akundaje

Can the value be a numpy array or a list? For cases where it's like [.5, .5, 0, 0]?

Avsecz avatar Sep 12 '18 20:09 Avsecz

Then the API could basically be to provide a dictionary {"A": np.array([1, 0, 0, 0]), ...} etc and one would need char2index at all. Just hope that dicts are valid in cython and efficiently implemented.

Avsecz avatar Sep 12 '18 20:09 Avsecz

for a case like that, I'd suggest representing it as[(0, 0.5), (1, 0.5)] and then populate the array, if you want to do it in cython. does this make sense?

jisraeli avatar Sep 12 '18 20:09 jisraeli

ie only set the values you need to set.

jisraeli avatar Sep 12 '18 20:09 jisraeli

I could add functionality to use the format:{"A": np.array([1, 0, 0, 0]), ...} R is A or G: {"R": np.array([.5, 0, .5, 0]), ...} and include this in a new version of one_hot_encode_sequence

But I haven't used cython, so I don't understand the syntax [(0, 0.5), (1, 0.5)] It is a sparse-matrix syntax?

GabrielHoffman avatar Sep 12 '18 20:09 GabrielHoffman

@jisraeli what if you need to set multiple values like @GabrielHoffman asked for 'R'?

Avsecz avatar Sep 12 '18 20:09 Avsecz

@GabrielHoffman the idea I suggested is to map each letter/symbol to a list of (index, value) pairs (where the value is non-zero) for that position and then set those in the encoding function that has the array. For the 'R' example above, I'd suggest doing [(0, 0.5), (2, 0.5)] in the symbol mapping function and then set those values in the encoding function. @Avsecz does this make sense? I'm not sure I'm making any sense.

jisraeli avatar Sep 12 '18 20:09 jisraeli

The use case is described in my biorXiv paper: https://www.biorxiv.org/content/early/2018/08/09/389056

I would like to implement the following (text from the methods section):

Personalized genome sequences were constructed using the GRCh37 reference genome with sites modified according to biallelic SNPs in the whole genome sequence using the bcftools consensus command. At homozygous alternate sites the reference allele is simply replaced by the alternative allele. Heterozygous sites are represented using the IUPAC nucleotide ambiguity codes (Comnish-Bowden, 1985), so that for example an A/C heterozygote is indicated with the letter ‘M’. Only biallelic SNPs are considered, so there are 6 ambiguity codes, one for each pair of nucleotides.

Non-ambiguous sites are one-hot coded as a matrix of mostly 0’s with 4 rows corresponding to ‘A’, ‘T’, ‘C’, and ‘G’. Coding a 1 in the ‘T’ row indicates the presence of that nucleotide in the corresponding position in the genome sequence. Ambiguous sites are encoded with a value of 0.5 in the two corresponding rows. Thus the training data does not explicitly include any information about phasing of the SNPs or allele-specific signals.

GabrielHoffman avatar Sep 12 '18 20:09 GabrielHoffman

@jisraeli Yes, I see it now. I will code up and get back to you with any questions.

GabrielHoffman avatar Sep 12 '18 20:09 GabrielHoffman

👍

jisraeli avatar Sep 12 '18 20:09 jisraeli

@GabrielHoffman if you would like to merge what you code up, please include a test case with the new symbols. would love to maintain 90% test coverage :)

jisraeli avatar Sep 12 '18 20:09 jisraeli

map each letter/symbol to a list of (index, value) pairs

@jisraeli. Long day. Now I get it! Somehow over-read the 'list' word. Great idea!:D

Avsecz avatar Sep 12 '18 21:09 Avsecz

👍

jisraeli avatar Sep 12 '18 21:09 jisraeli

I created a new cython function also named one_hot_encode_sequence with the exact arguments as the original. This function defines each character as a 4 entry array A -> [1,0,0,0]

I also defined 3 other functions and saw that this one was the fastest

  • one_hot_encode_sequence_ambig(): python version of this function that create hash table of the letter to arrays

  • one_hot_encode_sequence_v2(): experimental function, used as the basis for final one_hot_encode_sequence()

  • one_hot_encode_sequence_v3(): Following Johnny's suggestion, only encode non-zero elements

See testing code. test_nucleotide_encoding() shows that these functions give equivalent values. test_nucleotide_encoding_runtime() gives timings for each function.

I expected one_hot_encode_sequence_v3() only encode non-zero elements to be the fastest, but I think it is spending too much time re-initializing the same hash and arrays at each function call. One option to avoid this re-initializing is to use a static variable like in C that is only initialized once, but I couldn't figure this out in python/cython. The other option is to put one_hot_encode_sequence() in an object that stores this mapping hash. But that would require changing the function calls to one_hot_encode_sequence() in the rest of the package and dependencies.

Do you have a better way to address this?

GabrielHoffman avatar Sep 24 '18 14:09 GabrielHoffman