msprime icon indicating copy to clipboard operation
msprime copied to clipboard

segmentation fault when trying to run msprime.log_arg_likelihood()

Open savitakartik opened this issue 1 year ago • 3 comments

import tskit
import msprime
import random

random.seed(6)
nreps = 20
ne = [10, 1e2, 1e3, 1e4, 3e4, 5e4, 7e4, 1e5]

for i in range(nreps):
        ifile = "slim_msp_rep_" + str(i+1) + ".trees"
        ts = tskit.load(ifile)
        for j in range(len(ne)):
                print(msprime.log_arg_likelihood(ts, recombination_rate = 0, Ne = ne[j]))

This code results in a Segmentation fault more often than not. The tree sequences are from a single-locus SLiM model without recombination, followed by recapitation of the deep past with msprime. I use a random seed of 6 with SLiM and msprime. I'm not sure how to reproduce the error as the code runs fine for the same tree sequence on occasion.

savitakartik avatar Sep 13 '22 10:09 savitakartik

Thanks for the bug report @savitakartik.

It looks like this will fail for a specific tree sequence file that you have stored - could you narrow it down to one file that definitely does result in a segfault, at least some of the time, please?

The file shouldn't be too big, so hopefully you can just attach it here.

jeromekelleher avatar Sep 13 '22 10:09 jeromekelleher

This is speculation without seeing the details, but since part of the tree comes from SLiM I'm guessing this will have something to do with simultaneous mergers. I wouldn't be surprised if those could cause the way the likelihood calculation traverses nodes and edges to go haywire. The likelihood is only really valid for tree sequences which are output by the Hudson coalescent with recombination algorithm; for anything else we should expect things to break. Even if the code runs, the output won't be very meaningful.

JereKoskela avatar Sep 16 '22 10:09 JereKoskela

That's true - still, we shouldn't segfault, so I'd like to get to the bottom of this.

jeromekelleher avatar Sep 16 '22 11:09 jeromekelleher

For example, this tree sequence file segfaults for Ne=10, 1000, 10000 slim_msp_rep_7.zip

savitakartik avatar Oct 11 '22 10:10 savitakartik

Hmm, I can't reproduce. I've run this code:

import msprime
import tskit
import tqdm

ts = tskit.load("slim_msp_rep_7.trees")
print(ts)

for _ in tqdm.tqdm(range(1000)):
    ll = msprime.log_arg_likelihood(ts, recombination_rate = 0, Ne=1000)
print(ll)

Are you running this on the BMRC cluster? What version of msprime/Python etc?

jeromekelleher avatar Oct 11 '22 12:10 jeromekelleher

Sorry, wrong file - could you try this file but without the tqdm module, although I don't know why it would make a difference. I get a segmentation fault with:

import msprime
import tskit

ts = tskit.load("slim_rep_7.trees")
print(ts)

for i in range(1000):
        ll = msprime.log_arg_likelihood(ts, recombination_rate = 0, Ne=1000)
        print(i, ll)

but when I import tqdm (e.g. your code above), it works fine.

I'm using Python 3.10.5, msprime 1.2.0, tskit 0.5.2 on the BMRC cluster. slim_rep_7.zip

savitakartik avatar Oct 11 '22 14:10 savitakartik

Great, just reproduced:

$ gdb python3
(gdb) run segfault.py 
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007ffff73d5274 in msp_log_likelihood_arg (ts=ts@entry=0x7fffffffd590, r=0, Ne=1000, 
    r_lik=r_lik@entry=0x7fffffffd588) at lib/likelihood.c:179
179                 edge = last_parent_edge[edges->child[edge]];
(gdb) bt
#0  0x00007ffff73d5274 in msp_log_likelihood_arg (ts=ts@entry=0x7fffffffd590, r=0, Ne=1000, 
    r_lik=r_lik@entry=0x7fffffffd588) at lib/likelihood.c:179
#1  0x00007ffff73be4db in msprime_log_likelihood_arg (self=<optimised out>, 
    args=<optimised out>, kwds=<optimised out>) at msprime/_msprimemodule.c:3198
#2  0x00000000005f6929 in PyCFunction_Call ()
#3  0x00000000005f74f6 in _PyObject_MakeTpCall ()
#4  0x000000000057164c in _PyEval_EvalFrameDefault ()
#5  0x0000000000569dba in _PyEval_EvalCodeWithName ()
#6  0x00000000005f6eb3 in _PyFunction_Vectorcall ()
#7  0x000000000056cc1f in _PyEval_EvalFrameDefault ()
#8  0x0000000000569dba in _PyEval_EvalCodeWithName ()
#9  0x00000000006902a7 in PyEval_EvalCode ()
#10 0x000000000067f951 in ?? ()
#11 0x000000000067f9cf in ?? ()
#12 0x000000000067fa71 in ?? ()
#13 0x0000000000681b97 in PyRun_SimpleFileExFlags ()
#14 0x00000000006b9d32 in Py_RunMain ()
#15 0x00000000006ba0bd in Py_BytesMain ()
#16 0x00007ffff7dd7083 in __libc_start_main (main=0x4efd60 <main>, argc=2, 
    argv=0x7fffffffdea8, init=<optimised out>, fini=<optimised out>, 
    rtld_fini=<optimised out>, stack_end=0x7fffffffde98) at ../csu/libc-start.c:308
#17 0x00000000005fc5fe in _start ()

so the segfault happens here

Digging in:

(gdb) frame 0
#0  0x00007ffff73d5274 in msp_log_likelihood_arg (ts=ts@entry=0x7fffffffd590, r=0, Ne=1000, r_lik=r_lik@entry=0x7fffffffd588) at lib/likelihood.c:179
179                 edge = last_parent_edge[edges->child[edge]];
(gdb) print edge
$1 = 36521
(gdb) print edges->child[edge]
$2 = 246130
(gdb) print last_parent_edge[246130]
Cannot access memory at address 0x2853448

so, looks like we've run over in the last_parent_edge array. But hmm, there's only 36K nodes, so perhaps the child[edge] above was unset?

(gdb) print nodes->num_rows
$3 = 36522
$4 = 36522
(gdb) print edges->num_rows
$5 = 36521
(gdb) print edge
$6 = 36521

Ah, no, we've overrun the edge table. How did edge become equal to the number of rows in the table? Because we increment edge here and assume that it corresponds to a valid index.

So, we just need to check this, I think. It probably can't happen if we have binary trees.

@JereKoskela - should we fix this corner case and return a value, or should we check for binary trees in the input and return an error when the trees don't conform to our expectations?

jeromekelleher avatar Oct 12 '22 11:10 jeromekelleher

I think it would be cleanest to return an error when the tree isn't binary, since any number we return won't really mean much. Is there an easy, existing way to check that @jeromekelleher, or do you just mean adding a check to when we increment edge?

JereKoskela avatar Oct 12 '22 12:10 JereKoskela

Something like this in the Python code is probably best:

for tree in ts.trees():
     if np.any(tree.num_children_array > 2):
          raise ValueError(...)

Although should probably check for unary nodes etc as well.

jeromekelleher avatar Oct 12 '22 12:10 jeromekelleher

@jeromekelleher, any idea why python3 -m pytest would insist that Tree object has no attribute num_children_array?

Running git submodule update --init --recursive, then python -m pip install -r requirements/development.txt and make in the project root all complete without errors, and I'm parallel with the latest version of the main msprime repo.

JereKoskela avatar Oct 15 '22 10:10 JereKoskela

@JereKoskela what version of tskit are you using? Pip won't upgrade it unless asked and the dev requirements might need a pin. python -m pip install --upgrade tskit

benjeffery avatar Oct 15 '22 10:10 benjeffery

Upgrading tskit did the job. I had naively assumed reinstalling requirements would do that. Thanks @benjeffery!

JereKoskela avatar Oct 15 '22 10:10 JereKoskela

Sounds like we need a pin. Will raise an issue!

benjeffery avatar Oct 15 '22 11:10 benjeffery

Reopening this until we update the CHANGELOG

jeromekelleher avatar Oct 18 '22 12:10 jeromekelleher