smudgeplot icon indicating copy to clipboard operation
smudgeplot copied to clipboard

Paralelization of the kmer search

Open KamilSJaron opened this issue 4 years ago • 6 comments

Hi @KamilSJaron ,

I was wondering how the parallelization you tried was implemented. Because I also saw in a different thread that you tried to create smudgeplots based on heterozygosity of a single position in a k-mer, with limited success, correct?

The details of how smudge_pairs is working are unknown to me, but would it be an idea to give smudge_pairs a commandline option (--pos) only to look for pairs at position n in a k-mer with n =< length(k-mer) ? This way, it would be possible to have length(k-mer) single threaded processes in parallel, querying the same KMC database.

Would love to hear your thoughts on this.

Originally posted by @RNieuwenhuis in https://github.com/KamilSJaron/smudgeplot/issues/64#issuecomment-662963242

KamilSJaron avatar Jul 24 '20 13:07 KamilSJaron

@RNieuwenhuis yep, we did experiment with using only kmers differing in the middle nt. The implementation of the middle position kmer pair search was successful (fast and without too high memory requirements), but the biological outcome was rather dissatisfactory. For plenty of genomes we lost quite a lot of heterozygosity signal at the expense of paralogs. The algorithm is still available with you specify flag "--middle" (look at the corresponding wiki page).

I have not thought of generalizing the algorithm for "any position" to get a parallel version. You are right. I guess that would help a lot with memory requirements too. Hmm, and the middle algorithm is actually really really simple generalizing it would not be a problem at all, actually, it would just take the k_middle variable as an argument. Hmmm.

Do you have big data you would like to run it on?

KamilSJaron avatar Jul 24 '20 14:07 KamilSJaron

@RNieuwenhuis you also wrote

Also, can you indicate in the README and/or wiki what are the differences between the het_kmers and smudge_pairs? I am running smudge_pairs for a month already and the tsv file only has 1m lines an running.

A month is certainly too long. Did you run the KMC or the pythonic version? The pythonic version is much slower compared to KMC.

KamilSJaron avatar Jul 24 '20 14:07 KamilSJaron

Hi @KamilSJaron

Thank you for your reply and making this a new thread.

A month is certainly too long. Did you run the KMC or the pythonic version? The pythonic version is much slower compared to KMC.

I am still running smudge_pairs, if I am not mistaken that is the KMC version of the fork tbenavi1/KMC The data being processed is actually quite a large dataset with a size of 293 Gbp . (Possible hexaploid plant species with genome size of around 10-12 Gbp) However the ploidy level is so far quite uncertain as the Genomescope models fail to capture this intriguing genome. That's why we really need the smudgeplots. Which I think are a very clever idea in the first place!

With regards to your reply on my parallelization idea, that sounds like it indeed would be relatively easy to implement for the python version. If it works, the KMC smudge_pairs version might be omitted. Or would you say that it would also be worth it to implement in the KMC version making it even faster.

RNieuwenhuis avatar Jul 24 '20 15:07 RNieuwenhuis

Ouch, that's a massive genome. I completely see why it might take so long.

To be honest, I am right now a bit buried in other obligations and I won't be able to actively develop smudgeplot for the next few months for sure. However, we would be happy to get more people involved. Would like to try/be willing to write the parallelization and do a pull request?

KamilSJaron avatar Jul 27 '20 09:07 KamilSJaron

I see, I am definitely willing to do a pull request. You prefer to do it on the pythonic version, correct?

I saw that @tbenavi1 also tried the middle middle one away approach in a branch of his KMC fork. Where should I focus on?

RNieuwenhuis avatar Jul 27 '20 11:07 RNieuwenhuis

Excellent!

I think the most reasonable would be to write first a pythonic version. Even if it should be just a prototype for a KMC based version. I think it's reasonable to do a proof of concept first; and also it's very much possible you will figure that the pythonic version is fast enough not to bother with KMC optimization at all. It would be great to have a small test that would show that the two methods (the current algorithm that searchest for kmer pairs and the proposed parallel version) find the same set of kmers.

KamilSJaron avatar Jul 27 '20 11:07 KamilSJaron

Is resolved in the sploidyplot branch

KamilSJaron avatar Aug 17 '23 14:08 KamilSJaron