tskit
tskit copied to clipboard
Multichrom example
~Stacked on #2663 (and #2619)~
@petrelharp @bhaller @molpopgen You might be interested in this: the new file c/examples/multichrom_wright_fisher.c gives a simple example of multi chromosome simulations with parallel simplify (most of the other stuff is from other in-flight PRs)
This adds a simple example of running a multi chromosome forwards simulation and running the sort/simplify steps in parallel using pthreads. This was really just to make sure that the basic idea of share the node table across the different table collections would work, and that we were getting the details right with the new TSK_SIMPLIFY_NO_FILTER_NODES and TSK_SIMPLIFY_NO_UPDATE_SAMPLE_FLAGS.
Running this through helgrind gives no errors (I think it's pretty good at picking up data races):
$ valgrind --tool=helgrind ./build/multichrom_wright_fisher 1000 10 1 tmp/sdf 1234
So, hooray, @molpopgen's ingenious idea works and with the infrastructure in #2663 and #2619 we can do it pretty easily!
Another thing I ended up doing here in 31c47e402fd02a19e64bb90454be48cfb42b2cb8 is to add a basic implementation of "delete rows", which is an operation we badly need in the API I think (mentioned here: https://github.com/tskit-dev/tskit/issues/1034#issuecomment-1304885063). I think it's worth talking about that separately, so I'll open an issue for it.
We can probably let this sit for a while now anyway, and should try to clear up some other stuff first.
Great! I'll take a closer look in a bit.
Super cool. Good to know about helgrind, too; I will look into that too! :->
I guess this is relevant to https://github.com/tskit-dev/tskit/issues/176, and might inform how we deal with multiple chromosomes.
I guess this is relevant to #176, and might inform how we deal with multiple chromosomes.
As I understand it, we could choose to make the two things tied together – to make the separate tree sequences begin/end at chromosome boundaries – but we don't have to do that, and there are pretty strong reasons not to do that, perhaps. (For example, you want to be able to keep multiple tree sequences to speed up even single-chromosome models; and also, chromosomes will be of very different lengths, whereas you want to divide the work of simplification as evenly as you can across your multiple tree sequences; and also, the number of chromosomes you are simulating will often not match the number of threads you have available for simplification.) So I would tentatively lean towards not connecting these two things.
I would tentatively lean towards not connecting these two things.
I think it's not exclusive. If you want to represent multiple chromosomes, then this gives a nice way to do it. But there's not reason why you couldn't also do it on a single chromosome, or indeed on part of a chromosome within a multi-chromosome set.
But I do take you point that there's an elegancy about keeping the things completely independent.
This is really about low-level C API stuff here for now - how we might (or not) surface this to higher level things like representing multiple chromosomes is something we should wait and see (which is I think what we're all saying!).
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 89.94%. Comparing base (
44c7d7c) to head (f82a135). Report is 3 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2665 +/- ##
=======================================
Coverage 89.94% 89.94%
=======================================
Files 29 29
Lines 32651 32651
Branches 5854 5854
=======================================
Hits 29368 29368
Misses 1869 1869
Partials 1414 1414
| Flag | Coverage Δ | |
|---|---|---|
| c-tests | 86.66% <ø> (ø) |
|
| lwt-tests | 80.78% <ø> (ø) |
|
| python-c-tests | 89.23% <ø> (ø) |
|
| python-tests | 98.83% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@petrelharp @bhaller - I realised there was an easy way to join the tables by appending all the edges to the first table, sorting and then squashing adjacent edges. This should be equivalent to the haploid_wright_fisher example now (but that'll need to be verified).
@petrelharp @bhaller - I realised there was an easy way to join the tables by appending all the edges to the first table, sorting and then squashing adjacent edges. This should be equivalent to the haploid_wright_fisher example now (but that'll need to be verified).
Ah, probably need to set the sample flags for the final generation.
One more update - I've marked the sample flags and updated the original WF example so that it also indexes the output, and I think the outputs are now identical up to node reordering:
tree 0:
num_sites: 0
interval: (0.000000, 0.006008)
322 321 >
┏━━━━━┳━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━┓ ┏━━━┻━━━━┓ >
312 ┃ ┃ ┃ ┃ ┃ >
┏━━━━━━━━┻━━━━━━━━┓ ┃ ┃ ┃ ┃ ┃ >
299 ┃ ┃ ┃ ┃ ┃ ┃ >
┏━━━━━┻━━━━━┓ ┃ ┃ ┃ ┃ ┃ ┃ >
┃ ┃ ┃ ┃ ┃ ┃ 278 ┃ >
┃ ┃ ┃ ┃ ┃ ┃ ┏━┻┓ ┃ >
┃ ┃ ┃ ┃ ┃ 258 264 ┃ ┃ ┃ >
┃ ┃ ┃ ┃ ┃ ┏━━━━━━━┻━━━━━━━┓ ┏━━━━━━━┻━━━━━━━┓ ┃ ┃ ┃ >
238 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ >
┏━┻━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ >
┃ ┃ 228 ┃ ┃ ┃ 229 ┃ ┃ ┃ ┃ ┃ ┃ 219 >
┃ ┃ ┏━━━┻━━━┓ ┃ ┃ ┃ ┏━━━━━━━━━┻━┳━━━━━━━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┏━━━━━━━━━━━━━━━┻━┳━━━━━━━┳━━━━━┓ >
┃ ┃ ┃ ┃ 195 199 ┃ ┃ ┃ ┃ ┃ ┃ 184 ┃ ┃ 192 188 ┃ ┃ ┃ ┃ >
┃ ┃ ┃ ┃ ┏━━━╋━━━━┓ ┏┻━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┏━━━┻━━━┓ ┃ ┃ ┏━━┻━━┓ ┏━━━━━┻━┳━━━┓ ┃ ┃ ┃ ┃ >
176 ┃ ┃ ┃ ┃ 167 ┃ ┃ ┃ 170 ┃ 172 162 ┃ ┃ 165 ┃ ┃ ┃ ┃ 181 ┃ ┃ ┃ 169 ┃ ┃ 180 >
┏━┻┓ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┃ ┏━┻━┓ ┃ ┏━━━━╋━━━┓ ┏┻━┓ ┃ ┃ ┏━━━━━╋━━━━┓ ┃ ┃ ┃ ┃ ┏━┻┓ ┃ ┃ ┃ ┏━━━━━┳━┻━━━━━━┓ ┃ ┃ ┏┻━┓ >
┃ ┃ ┃ 102 138 ┃ ┃ ┃ ┃ ┃ ┃ ┃ 121 146 ┃ 101 ┃ ┃ ┃ 124 129 105 116 ┃ ┃ ┃ ┃ 107 ┃ ┃ 134 144 ┃ ┃ 120 126 119 123 ┃ ┃ >
┃ ┃ ┃ ┏━━╋━━┓ ┏━┻┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┏━━┳━┻┳━━┓ ┃ ┏━┻┓ ┃ ┃ ┃ ┏┻━┓ ┏━╋━━┓ ┏┻━┓ ┏┻━┓ ┃ ┃ ┃ ┃ ┏━┻┓ ┃ ┃ ┏━━╋━━┓ ┏━┻┓ ┃ ┃ ┏━━╋━━┓ ┏━━╋━━┓ ┏━━┳━┻┳━━┓ ┏┻━┓ ┃ ┃ >
0 96 57 14 69 90 58 87 11 20 93 47 1 34 8 66 85 17 26 89 91 21 40 54 95 75 81 18 31 2 9 12 3 36 16 49 22 72 4 32 45 55 67 88 5 42 92 62 98 73 6 7 19 60 84 38 70 76 24 28 43 74 63 77 68 79 1>
tree 1:
num_sites: 0
interval: (0.006008, 0.008218)
:
and the second
tree 0:
num_sites: 0
interval: (0.000000, 0.006008)
1 0 >
┏━━━━━━━━┳━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━┓ ┏━━━━━┻━━━━━┓ >
18 ┃ ┃ ┃ ┃ ┃ >
┏━━━━━━━━━━━┻━━━━━━━━━━━━┓ ┃ ┃ ┃ ┃ ┃ >
41 ┃ ┃ ┃ ┃ ┃ ┃ >
┏━━━━━━━┻━━━━━━━┓ ┃ ┃ ┃ ┃ ┃ ┃ >
┃ ┃ ┃ ┃ ┃ ┃ 53 ┃ >
┃ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ >
┃ ┃ ┃ ┃ ┃ 60 66 ┃ ┃ ┃ >
┃ ┃ ┃ ┃ ┃ ┏━━━━━━━━━━┻━━━━━━━━━━┓ ┏━━━━━━━━━━┻━━━━━━━━━━━┓ ┃ ┃ ┃ >
73 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ >
┏━━┻━━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ >
┃ ┃ 108 ┃ ┃ ┃ 109 ┃ ┃ ┃ ┃ ┃ ┃ >
┃ ┃ ┏━━━━┻━━━━┓ ┃ ┃ ┃ ┏━━━━━━━━━━━━━┻━┳━━━━━━━━━━━┓ ┃ ┃ ┃ ┃ ┃ ┃ >
┃ ┃ ┃ ┃ 129 133 ┃ ┃ ┃ ┃ ┃ ┃ 118 ┃ ┃ 126 1>
┃ ┃ ┃ ┃ ┏━━━━━╋━━━━━┓ ┏━┻━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┏━━━━━┻━━━━┓ ┃ ┃ ┏━━━┻━━━┓ ┏━━━━━━━>
170 ┃ ┃ ┃ ┃ 161 ┃ ┃ ┃ 164 ┃ 166 156 ┃ ┃ 159 ┃ ┃ ┃ ┃ 175 ┃ >
┏━┻━┓ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┃ ┃ ┏━━┻━━┓ ┃ ┏━━━━━╋━━━━━┓ ┏━┻━┓ ┃ ┃ ┏━━━━━━┻┳━━━━━┓ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ >
┃ ┃ ┃ 180 216 ┃ ┃ ┃ ┃ ┃ ┃ ┃ 199 224 ┃ 179 ┃ ┃ ┃ 202 207 183 194 ┃ ┃ ┃ ┃ 185 ┃ ┃ 212 >
┃ ┃ ┃ ┏━━━╋━━━┓ ┏━┻━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓ ┏━━━┳━┻━┳━━━┓ ┃ ┏━┻━┓ ┃ ┃ ┃ ┏━┻━┓ ┏━━━╋━━━┓ ┏━┻━┓ ┏━┻━┓ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┃ ┏━━━╋━━━┓ >
231 327 288 245 300 321 289 318 242 251 324 278 232 265 239 297 316 248 257 320 322 252 271 285 326 306 312 249 262 233 240 243 234 267 247 280 253 303 235 263 276 286 298 319 236 273 323 2
We don't reorder the samples from 0 - n in the new version. If you do want this you can run simplify on it again and it'll renumber the output. (You could do this more cheaply too if you wanted to for compatibility purposes.)
I think we can meaningfully compare the timings of this with haploid_wright_fisher now @petrelharp and @bhaller
I think I know what's going on with the dissappointing scaling here. Here's an example from running the code:
$ N=10000; T=5000; S=1000
$ ./haploid_wright_fisher $N $T $S n10k.trees 42
Simplify at generation 4000: (10010000 nodes 20000000 edges) -> (84542 nodes 439275 edges)
Simplify at generation 3000: (10084542 nodes 20439275 edges) -> (92436 nodes 494357 edges)
Simplify at generation 2000: (10092436 nodes 20494357 edges) -> (97097 nodes 527279 edges)
Simplify at generation 1000: (10097097 nodes 20527279 edges) -> (100073 nodes 547226 edges)
Simplify at generation 0: (10100073 nodes 20547226 edges) -> (102133 nodes 560584 edges)
$ ./multichrom_wright_fisher $N $T $S n10k-multichrom.trees 42
Simplify 10010000 nodes
chunk 1: 11250775 -> 70861
chunk 5: 11249822 -> 68845
chunk 0: 11250882 -> 69585
chunk 6: 11248979 -> 71402
chunk 3: 11250703 -> 71395
chunk 7: 11247502 -> 71564
chunk 4: 11251640 -> 71959
chunk 2: 11249697 -> 71370
The numbers for the multichromosome example are numbers of edges before and after simplification. Notice that each worker has about one-half the total number of edges: there were 20M edges for the whole simulation (2 edges * 10K individuals * 1K generations); and each of the 8 workers have 10M edges to deal with.
This is because: suppose that a chromosome has been inherited in k pieces, i.e., there are k edges that partition that chromosome. If we divide the chromosome among n workers, then we will get n + k - 1 edges, since we're breaking the chromosome at (a) n places, and also at (b) k places. Unless we get lucky and some of the breakpoints fall exactly on a division between adjacent chunks. (In this simulation, everyone is divided into exactly k=2 pieces.)
Now, what does simplify have to do under the hood? It's got a data structure that records the collection of ancestral segments that it's tracking; and then it goes through edges one at a time and deals with them. This means that if the cost of dealing with each edge is O(1), then our maximal speedup would be (n+k-1)/(nk) ~ 1/k. (This is because we've increased the number of edges by a factor of (n+k-1)/k but then divided that among n workers.)
However, the work required by simplify's data structure does scale with the number of segments it is tracking, to some degree, since although simplify has to look at all those edges, it only has to actually do something when it encounters an edge that some of its segments move along. If this was a substantial contribution to runtime then we might expect better speedups. The things we'd need to know are (a) what portion of simplify's work is "do something to the segments" versus "look at the next edge"; and (b) once we divide up a chromosome into chunks, how much does that reduce the proportion of its edges through which a segment of ancestry passes? I'm not very clear on the point, but I suspect the answer to (b) is "not much"; if so, then there's no hope for a speedup.
However, this approach would work just fine if each worker got literally one chromosome, since then edge breakpoints would usually (well, half the time) fall exactly on the divisions between workers. I may do a modification to the code here to have a look at that case, and see if we have wins there.
ping @jeromekelleher @molpopgen
I did some very simple benchmarking - comparing runtimes for k chromosomes, done (a) in parallel, and (b) not. (To do this I added a new file, multichrom_wright_fisher_singlethreaded.c, which if we merge this example in probably don't want to keep.)
Here's the script:
#!/bin/bash
N=1000
T=5000
S=200
for k in $(seq 12)
do
/usr/bin/time ./multichrom_wright_fisher_singlethreaded $N $T $S n10k.trees 42 $k &> single${k}.log
/usr/bin/time ./multichrom_wright_fisher $N $T $S n10k.trees 42 $k &> multi${k}.log
done
echo "type num_chroms user system elapsed percent_cpu N T S" > results.txt
for t in single multi
do
(for x in ${t}*.log; do echo ${x%.log} $(tail -n 2 $x | head -n 1 | cut -f 1-4 -d ' ') $N $T $S; done | sed -e 's/user//' | sed -e 's/chroms//' | sed -e 's/system//' | sed -e 's/elapsed//' | sed -e 's/.CPU//' | sort -n ) | tr ' ' '\t' >> results.txt
done
With that, I get this: black points are singlethreaded, red are multithreaded. Note that the simulations are still doing a fair bit of non-parallel work that is included in these timings:
This is looking very good! The max CPU utilization percent reported is 675%. Here's the rest of the results:
type num_chroms user system elapsed percent_cpu
1 multi 1 4.22 0.48 4.70 99
2 multi 2 8.50 1.02 5.36 177
3 multi 3 13.21 1.47 5.92 247
4 multi 4 18.17 2.01 6.52 309
5 multi 5 24.05 2.86 8.27 325
6 multi 6 30.71 3.70 9.02 381
7 multi 7 38.30 4.45 9.79 436
8 multi 8 46.23 5.26 10.56 487
9 multi 9 54.43 6.11 11.41 530
10 multi 10 63.40 7.06 12.16 579
11 multi 11 73.02 8.35 12.95 628
12 multi 12 83.72 9.45 13.79 675
13 single 1 3.97 0.05 4.02 100
14 single 2 7.91 0.16 8.08 100
15 single 3 12.31 0.61 12.93 99
16 single 4 16.69 0.82 17.52 99
17 single 5 21.33 1.14 22.47 99
18 single 6 25.93 1.39 27.33 99
19 single 7 30.78 1.79 32.57 99
20 single 8 35.85 2.18 38.04 99
21 single 9 40.84 1.96 42.81 99
22 single 10 45.17 2.69 47.87 99
23 single 11 50.77 3.11 53.89 99
24 single 12 55.90 3.41 59.32 99
Aha! That's great @petrelharp, it really clarifies things :+1:
I'm thinking it's worth fixing up the multichrom example to merge in? @bhaller is enthusiastic about getting this into SLiM at some point. I think this just will require switching over delete_rows to keep_rows, and removing the extra file I put in?
OK, I'll fix it up when I get a chance.
I've updated this so that it's a clean build against upstream now @petrelharp. I think it needs a bit of documentation though - I don't really follow what the algorithm is doing. Could you update with some comments to help clarify that the model is, etc?
I think the docs build is because of a stale cache. The simplest thing is probably to create a new PR. I've squashed my original commits down to one.
I'm away on vacation for a couple of weeks, but:
- The speedup should be 75-80% at most if one doesn't do one "chromosome" per thread. That's what I showed w/my prototype when I presented this method.
- @petrelharp -- if I understand your figure, 12 chromosomes means 12 threads and a 6x speedup? Is that correct? If you run
perf, you should be able to quantify the % of time that is purely single-threaded, which will give a better sense of how efficient the threading is.
Hi @benjeffery and @jeromekelleher. Do you really want to close this? It remains a useful example; I'm going to be consulting it carefully in the multichromosome work I'm doing in SLiM right now, in fact, and I can imagine other folks finding it interesting and useful too. I'm not saying you need to merge it necessarily, now or perhaps ever, but if it's closed, it seems to me that a fair bit of very useful knowledge and discussion will become much harder to find. Reopening so this comment doesn't get lost, but if you really want to close it, I won't get into a close/reopen war with you. :->
It's fairly esoteric knowledge @bhaller which I'm not sure is worth keeping a PR open in perpetuity for. We've learned the key lessons from it, I think, and moved on?
But anybody who wants to actually use tskit in a multithreaded manner (which would be me, and very probably others, sooner or later) will continue to find it useful. Which is the point of an example, isn't it? Why purge it from the knowledge pool?
If it's useful we should merge it I guess - I didn't think too hard about it tbh. Keeping it here in limbo isn't an answer.
Does anyone want to have a look over and make suggestions for what would be needed to make this a useful documentation example?
I nominate @petrelharp :->. In all seriousness, he knows the tskit ecosystem way better than I do, knows your doc system, and has been involved in this all along. Can I volunteer you, Peter? :->
I think the proposal is to add something to the docs, here. I could do that, sure. I'm imagining just talking through, like "we can have the tables for each chromosome share the same node and individual tables; here's how we do that".
I can finish it up too @petrelharp, if you'd prefer to have a look through and post some review comments.
Go for it!
This code doesn't reset the sample flags; I've added a bit that does that, and also done one of @bhaller's suggestion. There's one more suggestion that I don't know where the balance between 'code complexity' and 'safe coding' is; let's get that sorted or not and merge this?
Okay - fixed those comments up. I'm not sure why the builds are failing?