ALLCools icon indicating copy to clipboard operation
ALLCools copied to clipboard

Use Zarr library directly as a parallel write sink.

Open theAeon opened this issue 1 year ago • 14 comments

This eliminates the full readback of the zarr chunks that was done to merge the outputs of the parallel _write_single_zarr block and is saving me a significant amount of time on large datasets.

Specifically, a 60 cell test dataset that was taking several hours in the zarr merge phase completed in 5 minutes. No I/O limited copy step.

theAeon avatar Dec 21 '24 00:12 theAeon

Looks like MCDS.open doesn't like the outputs of these (fails on obs_dim lookup with a KeyError) so...bear with me i guess.

theAeon avatar Feb 06 '25 03:02 theAeon

Siloing out the zarr v3 stuff into a PR for a later date, was more broken than I thought.

theAeon avatar Feb 16 '25 13:02 theAeon

Okay, I can confirm it actually works now.

theAeon avatar Feb 17 '25 02:02 theAeon

Well, it works in the single chunk, small dataset case. I don't know whether increasing the chunk size broke it or if large datasets don't work. not great either way-testing single chunk right now.

theAeon avatar Feb 17 '25 06:02 theAeon

Judging by my script actually taking the time to process my data this go-around, i think its reasonable to say that this PR works on chunk-size 1 and no other.

That being said I don't actually know that there's an advantage to multiple-cell chunks when you cut the copy step out of the equation, as they are fully independent.

theAeon avatar Feb 17 '25 19:02 theAeon

Hi @theAeon , thank you for the PR, sorry I didn't promptly follow this. I'm not actively working on ALLCools right now, and would only merge bug fix while try not do make breaking changes or add new functions.

I'm not exactly sure what caused your use case being so slow, we have previously generated MCDS for 100ks of cells, which seems to have acceptable speed. Do you want to provide some details about your use case?

Specifically, a 60 cell test dataset that was taking several hours in the zarr merge phase completed in 5 minutes. No I/O limited copy step.

lhqing avatar Feb 21 '25 15:02 lhqing

If I'm remembering correctly, the difference in the use case is that our region definition is ~400k tiny sites, which causes the output of write_single_zarr to be much larger and causes i/o hangs.

As to breaking changes, well, the goal here is to not break anything-which is why i brought up the chunk-size thing! Still working on validation-my most recent attempt didn't actually work but I also forgot to apply an unrelated patch so I'll keep you posted on that one.

theAeon avatar Feb 21 '25 18:02 theAeon

I see, yes, MCDS was not intent to handle 100ks of small features/mC sites, much larger feature number may cause performance issue.

We had similar needs previously so I wrote this BaseDS class, which aims to combine bunch of ALLC files at single BP level across genome. Feel free to check if this is relevant to you. https://github.com/lhqing/ALLCools/blob/c9f7be2ffd650c1a5430d2f6fc252c60f0c28e33/ALLCools/count_matrix/base_ds.py#L559C5-L559C21

lhqing avatar Feb 24 '25 16:02 lhqing

I think-although I could be incorrect-that the data we're working with is a combination of SBP and very small regions of several BP, so if I can get MCDS to work I would like it to.

Current status of my runthrough is that it appears to be doing the calculations as expected but is not actually writing the data to disk-so i think i may have broken something else when I added the dictionary of regiongroups. Will keep posted, as usual.

theAeon avatar Feb 24 '25 19:02 theAeon

A new update on this-it appears that the no-output issue I'm having is not due to chunk size or the dictionary of regiongroups. It seems that above some number of cells, its not outputting anything at all

....edit: or not, it just failed on a small set too. not sure what's happening

theAeon avatar Feb 28 '25 16:02 theAeon

....oh. malformed allc_table. bear with

theAeon avatar Feb 28 '25 16:02 theAeon

okay, so it turns out on small datasets it does in fact work as intended for both single and multiple obs_dim chunks. Will push dataset.py as i have it after confirming that it works at scale.

theAeon avatar Feb 28 '25 18:02 theAeon

Looks like its working, multi-cell chunks and all. Apologies for the delay-was using the wrong chromosome set on my end (among other things).

theAeon avatar Mar 07 '25 13:03 theAeon

semi-aside-i did just get this working w/ mpi4py's MPIPoolExecutor. Not sure how you'd want me to integrate that.

theAeon avatar Mar 07 '25 20:03 theAeon