abPOA
abPOA copied to clipboard
Is there any way to reduce memory consumption?
Hello, I'm experimenting with adding abPOA as an option within cactus (manuscript). Thanks for making a great tool -- it's amazingly fast.
I was wondering if there's a way to reduce memory consumption, however, in order to increase the sequence lengths I can run on. Right now it seems roughly quadratic in the sequence length, which is as expected when reading your manuscript. I'm curious to know if there are any options I can use to reduce this and/or if you've thought about using the banding to reduce the DP table size (as far as I can tell, it's only used to reduce computation)?
Hi Glenn, you are right. Right now, the banding in abPOA only reduces the time, not the memory, so it is still quadratic. I do plan to reduce the memory consumption in different ways, but I haven't implemented it yet. Will let you know if I have any progress.
Yan
I would also like to express my interest in resolving this issue :+1: It would be really nice to be able to take full advantage of the banding.
Hi Glenn,
In the latest version of abPOA (v1.2.0), I implemented minimizer-based seeding before POA, this can reduce the memory usage for long input sequence. For most of the time, it can produce nearly the same or even better alignment result.
Please try it out and let me know if this works for you.
Yan
Great! This is perfect timing since I was just about to review some of what I'd been doing for stitching alignments together. I'll try it out on Monday. Thanks!
@glennhickey, just updated abPOA to v1.2.1, removes a redundant sorting step which is very time-consuming.
Thanks for letting me know. I'm switching to 1.2.1 now. My 1.2.0 tests have been okay so far: it passes all Cactus unit tests, and let me disable our current stitching logic on a bigger run. I'll do a much bigger test this week and report the results.
Do you have a sense of the maximum sequence lengths I can pass in while using the seeding? I just got an error
[SIMDMalloc] mm_Malloc fail!
Size: 549755813888
when I allowed up to 1mb. Thanks.
@glennhickey what alignment parameters are you using?
@ekg I'm still using default everything (edit -- except wb/wf which I increased from 10/0.01 to 25/0.025). I haven't yet explored the parameter space much despite meaning to for a while, especially in the context of alignments between more distant species.
Until now, I've been capping abpoa jobs at 10kb (and using an overlapping sliding window and stitching the results together). Bumping this up to 1mb with the latest abpoa seemed to work on smaller tests but not on a bigger job.
I'm getting failures even on datasets that ran before (without the seeding)
...
== 05-19-2021 22:50:05 == [abpoa_anchor_poa] Performing POA between anchors ...
== 05-19-2021 22:50:07 == [abpoa_anchor_poa] Performing POA between anchors done.
== 05-19-2021 22:50:07 == [abpoa_build_guide_tree_partition] Seeding and chaining ...
== 05-19-2021 22:50:07 == [abpoa_build_guide_tree_partition] Seeding and chaining done!
== 05-19-2021 22:50:07 == [abpoa_anchor_poa] Performing POA between anchors ...
== 05-19-2021 22:50:07 == [abpoa_anchor_poa] Performing POA between anchors done.
Command terminated by signal 11
@glennhickey Can you share the dataset that causes the error/failure?
Sure. I'll need to hack cactus a bit to spit it out, but should be able to do that soon.
I was just about to send another error segfault I got without seeding
wget http://public.gi.ucsc.edu/~hickey/debug/abpoa_fail_may26.fa
abpoa ./abpoa_fail_may26.fa -m 0 -o out.msa -r 1 -N -b 100 -f 0.025 -M 96 -X 90 -O 400,1200 -E 30,1
But realized when I built with a newer -march
it worked. More specifically, upgrading -march=nehalem
to -march=haswell
fixed it. (Cactus had previously built against nehalem to maximize portability for releases). I think it's pretty likely the problem I mentioned above is related to this.
@glennhickey I did not get any error on my computer for this data. However, I did notice a big difference when using the scoring parameter you mentioned. Not only causes different MSA output but also eats a bigger memory and requires a longer run time.
abpoa ./abpoa_fail_may26.fa -m 0 -o out.msa -r 1 -N -b 100 -f 0.025 -M 96 -X 90 -O 400,1200 -E 30,1
[abpoa_main] Real time: 111.849 sec; CPU: 111.208 sec; Peak RSS: 17.228 GB.
abpoa ./abpoa_fail_may26.fa -m 0 -o out.msa -r 1 -N -b 100 -f 0.025
[abpoa_main] Real time: 28.047 sec; CPU: 27.946 sec; Peak RSS: 3.135 GB.
For seeding mode:
abpoa ./abpoa_fail_may26.fa -m 0 -o out.msa -r 1 -b 100 -f 0.025 -M 96 -X 90 -O 400,1200 -E 30,1
[abpoa_main] Real time: 94.547 sec; CPU: 70.398 sec; Peak RSS: 13.474 GB.
abpoa ./abpoa_fail_may26.fa -m 0 -o out.msa -r 1 -b 100 -f 0.025
[abpoa_main] Real time: 35.114 sec; CPU: 26.085 sec; Peak RSS: 4.830 GB.
Yes, that data works for me now too. I just thought it was interesting as that command line was the first I found that did not work on architectures older than haswell, so to reproduce the crash you'd have to build with -march nehalem
instead of -march native
(or use a computer that's more than 7 years old).
While the scoring parameters make a big difference in runtime, they also seem to help accuracy considerably when aligning different species together. The best we've found for this has been to use the default HOXD70 matrix from lastz
A | C | G | T | |
---|---|---|---|---|
A | 91 | ‑114 | ‑31 | ‑123 |
C | ‑114 | 100 | ‑125 | ‑31 |
G | ‑31 | ‑125 | 100 | ‑114 |
T | ‑123 | ‑31 | ‑114 | 91 |
On a simulation test, this matrix (which I override in abpt->mat) brings accuracy up by around 7% vs the abpoa defaults. On less divergent sequences, there is also an improvement but it is much much smaller.
Thanks for the information! I am working on changing the scoring parameter options.
Yan
Hi @yangao07 , I've been experimenting with the seeding option to run on longer sequence sizes. It works pretty well, but I get segfaults from time to time. Here are some examples
wget http://public.gi.ucsc.edu/~hickey/debug/abpoa_fail_mar17.tar.gz
tar xzf abpoa_fail_mar17.tar.gz
for f in abpoa_fail_mar17/*.fa; do abpoa $f -m 0 -r 1 -S ; done
If I understand, the memory with seeding is much lower... but only if abpoa can find enough seeds. If the sequences are too diverged, the memory can still explode.
If this is correct, do you think there would be a way to change the API to fail more gracefully in these cases? For example, if there are not enough seeds, and the memory will exceed a given threshold, return an error code. Or a function that checks the seeds in the input and estimates the memory requirement? Either of these would allow the user to use seeding when possible and fall back on another approach if it won't work.
Thanks as always for your wonderful tool!
but only if abpoa can find enough seeds. If the sequences are too diverged, the memory can still explode.
You are right, for divergent sequences, specifically with greatly different lengths, the memory can still be very large.
The memory size is simply dependent on the graph size and the sequence length, so it can be estimated. I can try to add a pre-calculation step for this.
Yan
Thanks, that would be amazing. Or even some kind of interface where the user passes in a MAX_SIZE parameter and abpoa exits 1 instead instead of trying to allocate >MAX_SIZE would be very helpful.
exits 1 instead instead of trying to allocate >MAX_SIZE would be very helpful.
oops, exit isn't much better than running out of memory -- would have to be a return code or exception.
Hey @glennhickey , I am working on adding some interfaces related to memory usage by abPOA.
Here is what I have done for now:
Added 2 variables in abpoa_t: status/req_mem
.
For status
, 0 means success, 1 means not enough memory, and 2 means other errors.
req_mem
indicates the size of memory abPOA tried to allocate but failed.
This way, by checking the status
variable, users can choose to re-run abpoa with adjusted parameters.
What do you think?
Wow, I'm happy to hear that you're thinking about this!
If I call abpoa, it fails a malloc, and instead of crashing it sets status to 1 and returns normally, that would be a big help indeed and I'd definitely like to try it out.
I would still be a bit worried though, because I run abpoa in a bunch of different threads on cloud instances, so I could imagine a case where a big malloc succeeds and abpoa takes 100% of the resources on a system, then all concurrent threads would crash and that would effectively bring down the job I was running anyway, even if it's not directly abpoa's fault.
Do you think there could be any way for me to give abpoa a limit, and ask it to set status 1 if it ever tries to allocate more than that amount of memory at one time?
thanks again for all your help.
I have thought about using the size limit. The concern is that the size abpoc allocates is the virtual size, not the resident size, and the virtual size could be much larger than the physical memory of the computer. I am not sure how to properly set the size limit, do you have any idea?
ok, I think I understand better now, thanks. I was (probably naively) hoping it was simple for you to detect how far from the band it'd gotten and abort before overrunning the memory. I'll try to see with some of my colleagues who know more about virtual address spaces than I do.