GFAffix icon indicating copy to clipboard operation
GFAffix copied to clipboard

speeding up deep graphs

Open ekg opened this issue 3 years ago • 20 comments

On deep graphs (2k-fold) I'm seeing gfaffix taking quite a bit of time. It's essentially single threaded, right? Is there a possible way to adapt it to operate in parallel?

ekg avatar May 17 '22 10:05 ekg

There is potential for speedup by multithreading parts of GFAffix, especially if the bottle neck is the graph traversal. I'd be interested in knowing which step in the algorithm affects the running time on these deep graphs. The graph traversal can be parallelized, but it's not embarrassingly parallelizable, as the graph editing (which is intertwined in the graph traversal) can only be done safely in a single thread.

danydoerr avatar May 18 '22 12:05 danydoerr

If you give me some sample dataset I can try to perform runtime analysis and try to improve running time by parallelized or not.

natir avatar May 19 '22 12:05 natir

@ekg do you have such a "deep graph" handy?

danydoerr avatar May 19 '22 12:05 danydoerr

Or just smaller but similar graph.

natir avatar May 19 '22 12:05 natir

Yes, but they are big. I don't think you need to use a deep graph to see if parallelism can help. Probably a chunk of the MHC will be enough to profile, or any graph that you've already used for testing that takes e.g. 5-10s to process.

On Thu, May 19, 2022, 14:42 Daniel Doerr @.***> wrote:

@ekg https://github.com/ekg do you have such a "deep graph" handy?

— Reply to this email directly, view it on GitHub https://github.com/marschall-lab/GFAffix/issues/2#issuecomment-1131637189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEM2VS4UROFRWBP62ILVKYZKHANCNFSM5WEPZ5WQ . You are receiving this because you were mentioned.Message ID: @.***>

ekg avatar May 19 '22 12:05 ekg

Do these deep graphs have a particular high node degree, or is the number of nodes exceptionally high?

danydoerr avatar May 19 '22 13:05 danydoerr

High node degree and path depth. 2k fold coverage

On Thu, May 19, 2022, 15:07 Daniel Doerr @.***> wrote:

Do these deep graphs have a particular high node degree, or is the number of nodes exceptionally high?

— Reply to this email directly, view it on GitHub https://github.com/marschall-lab/GFAffix/issues/2#issuecomment-1131666623, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEKGUXR4IN3CXK47MALVKY4KRANCNFSM5WEPZ5WQ . You are receiving this because you were mentioned.Message ID: @.***>

ekg avatar May 19 '22 13:05 ekg

GFAffix should scale linearly with path depth, overall this shouldn't dominate the runtime. But node degree is certainly a bottle neck.

danydoerr avatar May 19 '22 13:05 danydoerr

@ekg Could you give me advise how to build a similar graph, type of data, pipeline, tools, etc…

natir avatar May 19 '22 20:05 natir

@natir sorry to take a while here. I may need to share this somehow, but the graph is rather large and will take 4 days to build if you do so from scratch. I'll see if I can get a simpler test case together.

ekg avatar Jun 08 '22 10:06 ekg

@ekg Just so you know that your issue is not forgotten: I have a solution for parallelizing GFAffix and will work on this sometime soonish.

danydoerr avatar Mar 24 '23 13:03 danydoerr

@danydoerr any updates on the parallelization?

AndreaGuarracino avatar Dec 19 '24 22:12 AndreaGuarracino

@AndreaGuarracino thanks for asking. I'm preparing a release that should be out in the next few days. Do you want to have already a binary to test on? I'm also changing the cli...

danydoerr avatar Dec 20 '24 09:12 danydoerr

Of course, please 'exit' the binary xD


From: Daniel Doerr @.> Sent: Friday, December 20, 2024 03:31 To: marschall-lab/GFAffix @.> Cc: Andrea Guarracino @.>; Mention @.> Subject: Re: [marschall-lab/GFAffix] speeding up deep graphs (Issue #2)

@AndreaGuarracinohttps://github.com/AndreaGuarracino thanks for asking. I'm preparing a release that should be out in the next few days. Do you want to have already a binary to test on? I'm also changing the cli...

— Reply to this email directly, view it on GitHubhttps://github.com/marschall-lab/GFAffix/issues/2#issuecomment-2556613046, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO26XHWFI4VP52KTV7UBVPD2GPPYTAVCNFSM6AAAAABT56CEJKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJWGYYTGMBUGY. You are receiving this because you were mentioned.Message ID: @.***>

AndreaGuarracino avatar Dec 20 '24 15:12 AndreaGuarracino

@AndreaGuarracino voilà gfaffix-0.2.0-prerelease.tar.gz

danydoerr avatar Dec 20 '24 16:12 danydoerr

Thx! Now I've noticed your refined_deduplication branch!

Looks extremely great. With 1 thread it is already a bit faster than the current main branch. And it is also slimmer in memory.

AndreaGuarracino avatar Dec 20 '24 20:12 AndreaGuarracino

Yes. That is the current development branch from which I generated the binary. There should be a small speed up when using up to 4 threads (-p4)

danydoerr avatar Dec 20 '24 20:12 danydoerr

@AndreaGuarracino do you have ungfaffixed graphs of the new HPRC assemblies that I can test on?

danydoerr avatar Dec 21 '24 09:12 danydoerr

Not yet, they'll come soon. I've started using the new gfaffix everywhere, so it is already "in production".


From: Daniel Doerr @.> Sent: Saturday, December 21, 2024 03:10 To: marschall-lab/GFAffix @.> Cc: Andrea Guarracino @.>; Mention @.> Subject: Re: [marschall-lab/GFAffix] speeding up deep graphs (Issue #2)

@AndreaGuarracinohttps://github.com/AndreaGuarracino do you have ungfaffixed graphs of the new HPRC assemblies that I can test on?

— Reply to this email directly, view it on GitHubhttps://github.com/marschall-lab/GFAffix/issues/2#issuecomment-2558058548, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO26XHUH5LQIME267YIGIUT2GUWBRAVCNFSM6AAAAABT56CEJKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJYGA2TQNJUHA. You are receiving this because you were mentioned.Message ID: @.***>

AndreaGuarracino avatar Dec 21 '24 16:12 AndreaGuarracino

New version is released now. It still requires improvement in speed and memory usage for these large graphs. The bottleneck is now only the i/o part of gfaffix, and I already have some concrete ideas how to improve it. I'll leave this issue open for now.

danydoerr avatar Jan 04 '25 09:01 danydoerr