tskit Multichrom example

~Stacked on #2663 (and #2619)~

@petrelharp @bhaller @molpopgen You might be interested in this: the new file c/examples/multichrom_wright_fisher.c gives a simple example of multi chromosome simulations with parallel simplify (most of the other stuff is from other in-flight PRs)

This adds a simple example of running a multi chromosome forwards simulation and running the sort/simplify steps in parallel using pthreads. This was really just to make sure that the basic idea of share the node table across the different table collections would work, and that we were getting the details right with the new TSK_SIMPLIFY_NO_FILTER_NODES and TSK_SIMPLIFY_NO_UPDATE_SAMPLE_FLAGS.

Running this through helgrind gives no errors (I think it's pretty good at picking up data races):

$ valgrind --tool=helgrind ./build/multichrom_wright_fisher 1000 10 1 tmp/sdf 1234

So, hooray, @molpopgen's ingenious idea works and with the infrastructure in #2663 and #2619 we can do it pretty easily!

Another thing I ended up doing here in 31c47e402fd02a19e64bb90454be48cfb42b2cb8 is to add a basic implementation of "delete rows", which is an operation we badly need in the API I think (mentioned here: https://github.com/tskit-dev/tskit/issues/1034#issuecomment-1304885063). I think it's worth talking about that separately, so I'll open an issue for it.

We can probably let this sit for a while now anyway, and should try to clear up some other stuff first.

Dec 15 '22 15:12 jeromekelleher

Great! I'll take a closer look in a bit.

Dec 15 '22 16:12 molpopgen

Super cool. Good to know about helgrind, too; I will look into that too! :->

Dec 15 '22 19:12 bhaller

I guess this is relevant to https://github.com/tskit-dev/tskit/issues/176, and might inform how we deal with multiple chromosomes.

Dec 16 '22 13:12 hyanwong

I guess this is relevant to #176, and might inform how we deal with multiple chromosomes.

As I understand it, we could choose to make the two things tied together – to make the separate tree sequences begin/end at chromosome boundaries – but we don't have to do that, and there are pretty strong reasons not to do that, perhaps. (For example, you want to be able to keep multiple tree sequences to speed up even single-chromosome models; and also, chromosomes will be of very different lengths, whereas you want to divide the work of simplification as evenly as you can across your multiple tree sequences; and also, the number of chromosomes you are simulating will often not match the number of threads you have available for simplification.) So I would tentatively lean towards not connecting these two things.

Dec 16 '22 15:12 bhaller

I would tentatively lean towards not connecting these two things.

I think it's not exclusive. If you want to represent multiple chromosomes, then this gives a nice way to do it. But there's not reason why you couldn't also do it on a single chromosome, or indeed on part of a chromosome within a multi-chromosome set.

But I do take you point that there's an elegancy about keeping the things completely independent.

Dec 16 '22 18:12 hyanwong

This is really about low-level C API stuff here for now - how we might (or not) surface this to higher level things like representing multiple chromosomes is something we should wait and see (which is I think what we're all saying!).

Dec 16 '22 19:12 jeromekelleher

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.94%. Comparing base (44c7d7c) to head (f82a135). Report is 3 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2665   +/-   ##
=======================================
  Coverage   89.94%   89.94%           
=======================================
  Files          29       29           
  Lines       32651    32651           
  Branches     5854     5854           
=======================================
  Hits        29368    29368           
  Misses       1869     1869           
  Partials     1414     1414

Flag	Coverage Δ
c-tests	`86.66% <ø> (ø)`
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`89.23% <ø> (ø)`
python-tests	`98.83% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jan 10 '23 14:01 codecov[bot]

@petrelharp @bhaller - I realised there was an easy way to join the tables by appending all the edges to the first table, sorting and then squashing adjacent edges. This should be equivalent to the haploid_wright_fisher example now (but that'll need to be verified).

Jan 19 '23 17:01 jeromekelleher

@petrelharp @bhaller - I realised there was an easy way to join the tables by appending all the edges to the first table, sorting and then squashing adjacent edges. This should be equivalent to the haploid_wright_fisher example now (but that'll need to be verified).

Jan 19 '23 17:01 jeromekelleher

Ah, probably need to set the sample flags for the final generation.

Jan 19 '23 17:01 jeromekelleher

One more update - I've marked the sample flags and updated the original WF example so that it also indexes the output, and I think the outputs are now identical up to node reordering:

tree 0:
  num_sites: 0
  interval:  (0.000000, 0.006008)
                                                     322                                                         321                                                                         >
                                    ┏━━━━━┳━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━┓                                     ┏━━━┻━━━━┓                                                                     >
                  312               ┃     ┃                             ┃                                     ┃        ┃                                                                     >
          ┏━━━━━━━━┻━━━━━━━━┓       ┃     ┃                             ┃                                     ┃        ┃                                                                     >
         299                ┃       ┃     ┃                             ┃                                     ┃        ┃                                                                     >
    ┏━━━━━┻━━━━━┓           ┃       ┃     ┃                             ┃                                     ┃        ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                             ┃                                    278       ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                             ┃                                   ┏━┻┓       ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                            258                   264            ┃  ┃       ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                     ┏━━━━━━━┻━━━━━━━┓     ┏━━━━━━━┻━━━━━━━┓     ┃  ┃       ┃                                                                     >
   238          ┃           ┃       ┃     ┃                     ┃               ┃     ┃               ┃     ┃  ┃       ┃                                                                     >
  ┏━┻━┓         ┃           ┃       ┃     ┃                     ┃               ┃     ┃               ┃     ┃  ┃       ┃                                                                     >
  ┃   ┃        228          ┃       ┃     ┃                    229              ┃     ┃               ┃     ┃  ┃       ┃                                               219                   >
  ┃   ┃     ┏━━━┻━━━┓       ┃       ┃     ┃           ┏━━━━━━━━━┻━┳━━━━━━━┓     ┃     ┃               ┃     ┃  ┃       ┃                                ┏━━━━━━━━━━━━━━━┻━┳━━━━━━━┳━━━━━┓    >
  ┃   ┃     ┃       ┃      195     199    ┃           ┃           ┃       ┃     ┃     ┃              184    ┃  ┃      192            188                ┃                 ┃       ┃     ┃    >
  ┃   ┃     ┃       ┃   ┏━━━╋━━━━┓ ┏┻━┓   ┃           ┃           ┃       ┃     ┃     ┃           ┏━━━┻━━━┓ ┃  ┃    ┏━━┻━━┓     ┏━━━━━┻━┳━━━┓           ┃                 ┃       ┃     ┃    >
 176  ┃     ┃       ┃   ┃  167   ┃ ┃  ┃  170          ┃          172     162    ┃     ┃          165      ┃ ┃  ┃    ┃    181    ┃       ┃   ┃          169                ┃       ┃    180   >
┏━┻┓  ┃     ┃       ┃   ┃  ┏┻━┓  ┃ ┃  ┃ ┏━┻━┓         ┃      ┏━━━━╋━━━┓  ┏┻━┓   ┃     ┃     ┏━━━━━╋━━━━┓  ┃ ┃  ┃    ┃   ┏━┻┓    ┃       ┃   ┃   ┏━━━━━┳━┻━━━━━━┓          ┃       ┃    ┏┻━┓  >
┃  ┃  ┃    102     138  ┃  ┃  ┃  ┃ ┃  ┃ ┃  121       146     ┃   101  ┃  ┃  ┃  124   129   105   116   ┃  ┃ ┃  ┃   107  ┃  ┃   134     144  ┃   ┃    120      126        119     123   ┃  ┃  >
┃  ┃  ┃  ┏━━╋━━┓  ┏━┻┓  ┃  ┃  ┃  ┃ ┃  ┃ ┃  ┏┻━┓  ┏━━┳━┻┳━━┓  ┃  ┏━┻┓  ┃  ┃  ┃  ┏┻━┓ ┏━╋━━┓ ┏┻━┓  ┏┻━┓  ┃  ┃ ┃  ┃  ┏━┻┓  ┃  ┃ ┏━━╋━━┓  ┏━┻┓  ┃   ┃  ┏━━╋━━┓  ┏━━╋━━┓  ┏━━┳━┻┳━━┓  ┏┻━┓  ┃  ┃  >
0 96 57 14 69 90 58 87 11 20 93 47 1 34 8 66 85 17 26 89 91 21 40 54 95 75 81 18 31 2 9 12 3 36 16 49 22 72 4 32 45 55 67 88 5 42 92 62 98 73 6 7 19 60 84 38 70 76 24 28 43 74 63 77 68 79 1>

tree 1:
  num_sites: 0
  interval:  (0.006008, 0.008218)
:

and the second

tree 0:
  num_sites: 0
  interval:  (0.000000, 0.006008)
                                                                            1                                                                                    0                           >
                                                   ┏━━━━━━━━┳━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━┓                                                      ┏━━━━━┻━━━━━┓                     >
                         18                        ┃        ┃                                       ┃                                                      ┃           ┃                     >
              ┏━━━━━━━━━━━┻━━━━━━━━━━━━┓           ┃        ┃                                       ┃                                                      ┃           ┃                     >
             41                        ┃           ┃        ┃                                       ┃                                                      ┃           ┃                     >
      ┏━━━━━━━┻━━━━━━━┓                ┃           ┃        ┃                                       ┃                                                      ┃           ┃                     >
      ┃               ┃                ┃           ┃        ┃                                       ┃                                                     53           ┃                     >
      ┃               ┃                ┃           ┃        ┃                                       ┃                                                    ┏━┻━┓         ┃                     >
      ┃               ┃                ┃           ┃        ┃                                      60                              66                    ┃   ┃         ┃                     >
      ┃               ┃                ┃           ┃        ┃                            ┏━━━━━━━━━━┻━━━━━━━━━━┓         ┏━━━━━━━━━━┻━━━━━━━━━━━┓        ┃   ┃         ┃                     >
     73               ┃                ┃           ┃        ┃                            ┃                     ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┏━━┻━━┓            ┃                ┃           ┃        ┃                            ┃                     ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┃     ┃           108               ┃           ┃        ┃                           109                    ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┃     ┃       ┏━━━━┻━━━━┓           ┃           ┃        ┃              ┏━━━━━━━━━━━━━┻━┳━━━━━━━━━━━┓       ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┃     ┃       ┃         ┃          129         133       ┃              ┃               ┃           ┃       ┃         ┃                     118       ┃   ┃        126                   1>
   ┃     ┃       ┃         ┃     ┏━━━━━╋━━━━━┓   ┏━┻━┓      ┃              ┃               ┃           ┃       ┃         ┃                ┏━━━━━┻━━━━┓   ┃   ┃     ┏━━━┻━━━┓         ┏━━━━━━━>
  170    ┃       ┃         ┃     ┃    161    ┃   ┃   ┃     164             ┃              166         156      ┃         ┃               159         ┃   ┃   ┃     ┃      175        ┃       >
 ┏━┻━┓   ┃       ┃         ┃     ┃   ┏━┻━┓   ┃   ┃   ┃   ┏━━┻━━┓           ┃         ┏━━━━━╋━━━━━┓   ┏━┻━┓     ┃         ┃         ┏━━━━━━┻┳━━━━━┓   ┃   ┃   ┃     ┃     ┏━┻━┓       ┃       >
 ┃   ┃   ┃      180       216    ┃   ┃   ┃   ┃   ┃   ┃   ┃    199         224        ┃    179    ┃   ┃   ┃    202       207       183     194    ┃   ┃   ┃   ┃    185    ┃   ┃      212      >
 ┃   ┃   ┃   ┏━━━╋━━━┓   ┏━┻━┓   ┃   ┃   ┃   ┃   ┃   ┃   ┃   ┏━┻━┓   ┏━━━┳━┻━┳━━━┓   ┃   ┏━┻━┓   ┃   ┃   ┃   ┏━┻━┓   ┏━━━╋━━━┓   ┏━┻━┓   ┏━┻━┓   ┃   ┃   ┃   ┃   ┏━┻━┓   ┃   ┃   ┏━━━╋━━━┓   >
231 327 288 245 300 321 289 318 242 251 324 278 232 265 239 297 316 248 257 320 322 252 271 285 326 306 312 249 262 233 240 243 234 267 247 280 253 303 235 263 276 286 298 319 236 273 323 2

We don't reorder the samples from 0 - n in the new version. If you do want this you can run simplify on it again and it'll renumber the output. (You could do this more cheaply too if you wanted to for compatibility purposes.)

Jan 19 '23 20:01 jeromekelleher

I think we can meaningfully compare the timings of this with haploid_wright_fisher now @petrelharp and @bhaller

Jan 19 '23 20:01 jeromekelleher

I think I know what's going on with the dissappointing scaling here. Here's an example from running the code:

$ N=10000;  T=5000; S=1000
$ ./haploid_wright_fisher $N $T $S n10k.trees 42
Simplify at generation 4000: (10010000 nodes 20000000 edges) -> (84542 nodes 439275 edges)
Simplify at generation 3000: (10084542 nodes 20439275 edges) -> (92436 nodes 494357 edges)
Simplify at generation 2000: (10092436 nodes 20494357 edges) -> (97097 nodes 527279 edges)
Simplify at generation 1000: (10097097 nodes 20527279 edges) -> (100073 nodes 547226 edges)
Simplify at generation 0: (10100073 nodes 20547226 edges) -> (102133 nodes 560584 edges)

$ ./multichrom_wright_fisher $N $T $S n10k-multichrom.trees 42
Simplify 10010000 nodes
	chunk 1: 11250775 -> 70861
	chunk 5: 11249822 -> 68845
	chunk 0: 11250882 -> 69585
	chunk 6: 11248979 -> 71402
	chunk 3: 11250703 -> 71395
	chunk 7: 11247502 -> 71564
	chunk 4: 11251640 -> 71959
	chunk 2: 11249697 -> 71370

The numbers for the multichromosome example are numbers of edges before and after simplification. Notice that each worker has about one-half the total number of edges: there were 20M edges for the whole simulation (2 edges * 10K individuals * 1K generations); and each of the 8 workers have 10M edges to deal with.

This is because: suppose that a chromosome has been inherited in k pieces, i.e., there are k edges that partition that chromosome. If we divide the chromosome among n workers, then we will get n + k - 1 edges, since we're breaking the chromosome at (a) n places, and also at (b) k places. Unless we get lucky and some of the breakpoints fall exactly on a division between adjacent chunks. (In this simulation, everyone is divided into exactly k=2 pieces.)

Now, what does simplify have to do under the hood? It's got a data structure that records the collection of ancestral segments that it's tracking; and then it goes through edges one at a time and deals with them. This means that if the cost of dealing with each edge is O(1), then our maximal speedup would be (n+k-1)/(nk) ~ 1/k. (This is because we've increased the number of edges by a factor of (n+k-1)/k but then divided that among n workers.)

However, the work required by simplify's data structure does scale with the number of segments it is tracking, to some degree, since although simplify has to look at all those edges, it only has to actually do something when it encounters an edge that some of its segments move along. If this was a substantial contribution to runtime then we might expect better speedups. The things we'd need to know are (a) what portion of simplify's work is "do something to the segments" versus "look at the next edge"; and (b) once we divide up a chromosome into chunks, how much does that reduce the proportion of its edges through which a segment of ancestry passes? I'm not very clear on the point, but I suspect the answer to (b) is "not much"; if so, then there's no hope for a speedup.

However, this approach would work just fine if each worker got literally one chromosome, since then edge breakpoints would usually (well, half the time) fall exactly on the divisions between workers. I may do a modification to the code here to have a look at that case, and see if we have wins there.

ping @jeromekelleher @molpopgen

Sep 11 '23 21:09 petrelharp

I did some very simple benchmarking - comparing runtimes for k chromosomes, done (a) in parallel, and (b) not. (To do this I added a new file, multichrom_wright_fisher_singlethreaded.c, which if we merge this example in probably don't want to keep.) Here's the script:

#!/bin/bash

N=1000
T=5000
S=200

for k in $(seq 12)
do
    /usr/bin/time ./multichrom_wright_fisher_singlethreaded $N $T $S n10k.trees 42 $k &> single${k}.log
    /usr/bin/time ./multichrom_wright_fisher $N $T $S n10k.trees 42 $k &> multi${k}.log
done

echo "type num_chroms user system elapsed percent_cpu N T S" > results.txt
for t in single multi
do
    (for x in ${t}*.log; do echo ${x%.log} $(tail -n 2 $x | head -n 1 | cut -f 1-4 -d ' ') $N $T $S; done | sed -e 's/user//' | sed -e 's/chroms//' | sed -e 's/system//' | sed -e 's/elapsed//' | sed -e 's/.CPU//' | sort -n ) | tr ' ' '\t' >> results.txt
done

With that, I get this: black points are singlethreaded, red are multithreaded. Note that the simulations are still doing a fair bit of non-parallel work that is included in these timings: Screenshot from 2023-09-11 22-27-31

This is looking very good! The max CPU utilization percent reported is 675%. Here's the rest of the results:

     type num_chroms  user system elapsed percent_cpu
1   multi          1  4.22   0.48    4.70          99
2   multi          2  8.50   1.02    5.36         177
3   multi          3 13.21   1.47    5.92         247
4   multi          4 18.17   2.01    6.52         309
5   multi          5 24.05   2.86    8.27         325
6   multi          6 30.71   3.70    9.02         381
7   multi          7 38.30   4.45    9.79         436
8   multi          8 46.23   5.26   10.56         487
9   multi          9 54.43   6.11   11.41         530
10  multi         10 63.40   7.06   12.16         579
11  multi         11 73.02   8.35   12.95         628
12  multi         12 83.72   9.45   13.79         675
13 single          1  3.97   0.05    4.02         100
14 single          2  7.91   0.16    8.08         100
15 single          3 12.31   0.61   12.93          99
16 single          4 16.69   0.82   17.52          99
17 single          5 21.33   1.14   22.47          99
18 single          6 25.93   1.39   27.33          99
19 single          7 30.78   1.79   32.57          99
20 single          8 35.85   2.18   38.04          99
21 single          9 40.84   1.96   42.81          99
22 single         10 45.17   2.69   47.87          99
23 single         11 50.77   3.11   53.89          99
24 single         12 55.90   3.41   59.32          99

Sep 12 '23 02:09 petrelharp

Aha! That's great @petrelharp, it really clarifies things :+1:

Sep 12 '23 08:09 jeromekelleher

I'm thinking it's worth fixing up the multichrom example to merge in? @bhaller is enthusiastic about getting this into SLiM at some point. I think this just will require switching over delete_rows to keep_rows, and removing the extra file I put in?

Sep 12 '23 12:09 petrelharp

OK, I'll fix it up when I get a chance.

Sep 12 '23 15:09 jeromekelleher

I've updated this so that it's a clean build against upstream now @petrelharp. I think it needs a bit of documentation though - I don't really follow what the algorithm is doing. Could you update with some comments to help clarify that the model is, etc?

Sep 13 '23 13:09 jeromekelleher

I think the docs build is because of a stale cache. The simplest thing is probably to create a new PR. I've squashed my original commits down to one.

Sep 13 '23 13:09 jeromekelleher

I'm away on vacation for a couple of weeks, but:

The speedup should be 75-80% at most if one doesn't do one "chromosome" per thread. That's what I showed w/my prototype when I presented this method.
@petrelharp -- if I understand your figure, 12 chromosomes means 12 threads and a 6x speedup? Is that correct? If you run perf, you should be able to quantify the % of time that is purely single-threaded, which will give a better sense of how efficient the threading is.

Sep 13 '23 16:09 molpopgen

Hi @benjeffery and @jeromekelleher. Do you really want to close this? It remains a useful example; I'm going to be consulting it carefully in the multichromosome work I'm doing in SLiM right now, in fact, and I can imagine other folks finding it interesting and useful too. I'm not saying you need to merge it necessarily, now or perhaps ever, but if it's closed, it seems to me that a fair bit of very useful knowledge and discussion will become much harder to find. Reopening so this comment doesn't get lost, but if you really want to close it, I won't get into a close/reopen war with you. :->

Sep 23 '24 13:09 bhaller

It's fairly esoteric knowledge @bhaller which I'm not sure is worth keeping a PR open in perpetuity for. We've learned the key lessons from it, I think, and moved on?

Sep 23 '24 13:09 jeromekelleher

But anybody who wants to actually use tskit in a multithreaded manner (which would be me, and very probably others, sooner or later) will continue to find it useful. Which is the point of an example, isn't it? Why purge it from the knowledge pool?

Sep 23 '24 13:09 bhaller

If it's useful we should merge it I guess - I didn't think too hard about it tbh. Keeping it here in limbo isn't an answer.

Does anyone want to have a look over and make suggestions for what would be needed to make this a useful documentation example?

Sep 23 '24 14:09 jeromekelleher

I nominate @petrelharp :->. In all seriousness, he knows the tskit ecosystem way better than I do, knows your doc system, and has been involved in this all along. Can I volunteer you, Peter? :->

Sep 23 '24 14:09 bhaller

I think the proposal is to add something to the docs, here. I could do that, sure. I'm imagining just talking through, like "we can have the tables for each chromosome share the same node and individual tables; here's how we do that".

Sep 23 '24 19:09 petrelharp

I can finish it up too @petrelharp, if you'd prefer to have a look through and post some review comments.

Sep 24 '24 08:09 jeromekelleher

Go for it!

Sep 25 '24 04:09 petrelharp

This code doesn't reset the sample flags; I've added a bit that does that, and also done one of @bhaller's suggestion. There's one more suggestion that I don't know where the balance between 'code complexity' and 'safe coding' is; let's get that sorted or not and merge this?

Mar 31 '25 19:03 petrelharp

Okay - fixed those comments up. I'm not sure why the builds are failing?

Apr 01 '25 14:04 petrelharp