msprime sim_mutations and garbage collection

sim_mutations and garbage collection

Open mufernando opened this issue 2 years ago • 25 comments

I'm pretty sure sim_mutations is accumulating some garbage same as pyslim (see #pyslim211). Note how sim_mutations also calls ts.tables.

I don't have the bandwidth right now to make a proper test... But it seems this won't matter after the changes to tskit's memory usage goes in, right @benjeffery?

Dec 10 '21 06:12 mufernando

There shouldn't be any interaction between ts.tables and garbage collection @mufernando --- it's just making a copy now where we want it to return a view.

Dec 10 '21 09:12 jeromekelleher

Here's an example:

import msprime

ts = msprime.sim_ancestry(10**4, sequence_length=1000, random_seed=1234)

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

Then,

mprof run sim_mutations_mem.py
mprof plot -o mutations-mem.png

gives

mutations-mem

Looks pretty solid to me?

Dec 10 '21 09:12 jeromekelleher

Are you getting the same result @mufernando? Would be interesting of your system is accumulating for this example.

Dec 10 '21 11:12 benjeffery

I ran the same example (with a msprime tree sequence) and didn't run into issues.

mutations-mem

Then I ran a minimal example with three sim_mutation calls to one of my tree sequences (from SLiM, so with a ton of metadata everywhere). You can see the memory usage only grows over time.

mutations-slim

Lastly, I added some manual garbage collection in between sim_mutation calls and that seems to help quite a bit.

mutations-slim-gc

With some number of objects being accumulated in between calls:

Before: collected 0 objects.
After first call: collected 84 objects.
After second call: collected 84 objects.
After third call: collected 84 objects.

This was my script:

import tskit
import msprime
import gc

ts = tskit.load("../../output/E082AD5JLBXZ4U6/great-apes_E082AD5JLBXZ4U6_rep0.trees")

collected = gc.collect()
print("Before: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1, keep=False)
collected = gc.collect()
print("After first call: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=2, keep=False)
collected = gc.collect()
print("After second call: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=2, keep=False)
collected = gc.collect()
print("After third call: collected", collected, "objects.", flush=True)

and I can share the tree sequence, but it's 250M.

Dec 10 '21 17:12 mufernando

ping @molpopgen?

Dec 10 '21 18:12 petrelharp

Well, let me try to break a tie and see if I make a plot that looks more like @jeromekelleher or more like @mufernando.

Dec 10 '21 19:12 molpopgen

Very nice, thanks @mufernando

Dec 10 '21 19:12 jeromekelleher

import tskit
import msprime

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

for j in range(5):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

mutations-mem

Tie goes to @mufernando, I think? Did this with a file I'm working with at the moment.

Dec 10 '21 19:12 molpopgen

For completeness, I repeated my example w/gc.collect() after each bout of mutating.

mutations-mem2

Dec 10 '21 19:12 molpopgen

is this call in sim_mutations the issue?

https://github.com/tskit-dev/msprime/blob/c034a5d24a39ad888a603a2c2042298ad0f8265f/msprime/mutations.py#L1369

Dec 10 '21 19:12 andrewkern

The main difference between @jeromekelleher's example and the others seems to be the presence of metadata in the latter, which must be a clue?

Dec 11 '21 03:12 molpopgen

Just looking at your plot again @mufernando - is that really 20s to load a 250mb file? Damn that's slow!

Dec 11 '21 06:12 jeromekelleher

I agree with Kevin, it's certainly pointing towards the presence of metadata. I guess making a smaller reproducible example is the next step.

Dec 11 '21 06:12 jeromekelleher

Just looking at your plot again @mufernando - is that really 20s to load a 250mb file? Damn that's slow!

Hmm. Yeah. In mine, it is 14 sec for 1.5 gigs.

Dec 11 '21 08:12 molpopgen

Well, adding some top-level metadata doesn't do anything. This example has a flat memory profile:

ts = msprime.sim_ancestry(10**4, sequence_length=1000, random_seed=1234)
tables = ts.dump_tables()
tables.metadata_schema = tskit.MetadataSchema.permissive_json()
tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

Dec 11 '21 20:12 jeromekelleher

These experiments also gave a flat profile:

import msprime
import numpy as np
import tskit

ts = msprime.sim_ancestry(10 ** 4, sequence_length=1000, random_seed=1234)
ts = msprime.sim_mutations(ts, rate=1, random_seed=1)
tables = ts.dump_tables()

# This gave a flat profile
# tables.individuals.clear()
# ind = np.where(tables.nodes.individual != tskit.NULL)[0]
#
# tables.individuals.metadata_schema = tskit.MetadataSchema.permissive_json()
# for i, j in zip(ind, ind[1:]):
#    tables.individuals.add_row(metadata={"nodes": [int(i), int(j)]})

muts = tskit.MutationTable()
muts.metadata_schema = tskit.MetadataSchema.permissive_json()
for m in tables.mutations:
    muts.add_row(
        site=m.site,
        node=m.node,
        derived_state=m.derived_state,
        parent=m.parent,
        time=m.time,
        metadata={"x": [1, 2, 3, 4]},
    )

tables.mutations.metadata_schema = muts.metadata_schema
tables.mutations.set_columns(
    site=muts.site,
    node=muts.node,
    derived_state=muts.derived_state,
    derived_state_offset=muts.derived_state_offset,
    parent=muts.parent,
    time=muts.time,
    metadata=muts.metadata,
    metadata_offset=muts.metadata_offset,
)

# tables.metadata_schema = tskit.MetadataSchema.permissive_json()
# tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

Dec 11 '21 20:12 molpopgen

Note to self to try and use the methods outlined here to try to find reference cycles in the examples leading to growing memory use.

Dec 11 '21 23:12 molpopgen

After changing my script to what is shown below, I got an absolutely massive log of stuff. The log is also large if I skip the mutation step. However, the difference between the 2 is lots of variables called ll_table, many copies of various mutation, individual, etc., tables. When I modified the script to gc.collect() right after load and then run:

mprof run mutatification.py |grep -c MutationTable

I got a count of 30. That seems way too many?

The script:

import gc
import sys
import tskit
import msprime

gc.set_debug(gc.DEBUG_SAVEALL)

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

for j in range(5):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

gc.collect()
for i in gc.garbage:
    print(i)

Dec 13 '21 23:12 molpopgen

This version gives a count of 6 mutation tables:

import gc
import sys
import tskit
import msprime

gc.set_debug(gc.DEBUG_SAVEALL)

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

gc.collect()

for j in range(1):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

del ts
del mts

gc.collect()
for i in gc.garbage:
    print(i)

Dec 13 '21 23:12 molpopgen

Hmmmmm....

Dec 14 '21 08:12 jeromekelleher

Yeah, my intuition is no good on this. I think this is with tskit 0.3.7 installed, FWIW.

Dec 14 '21 12:12 molpopgen

Thanks, this is very helpful. I'll dig in when I get a chance. Identifying if we have a refcount cycle will be very helpful and motivating in the read-only tables work: https://github.com/tskit-dev/tskit/pull/2057

Dec 14 '21 13:12 jeromekelleher

I haven't had luck finding a runtime analysis tool that'll print out reference cycles. Do you know of one?

Dec 14 '21 14:12 molpopgen

No but I figure if gc finds things in simple cases like this that's evidence they exist?

Dec 14 '21 15:12 jeromekelleher

I believe that's correct. But I want a back trace analog here. I want to see where they're coming from. But I'd also like a unicorn I guess.

Dec 14 '21 15:12 molpopgen

msprime msprime copied to clipboard

sim_mutations and garbage collection

msprime
msprime copied to clipboard