msprime icon indicating copy to clipboard operation
msprime copied to clipboard

sim_mutations and garbage collection

Open mufernando opened this issue 2 years ago • 25 comments

I'm pretty sure sim_mutations is accumulating some garbage same as pyslim (see #pyslim211). Note how sim_mutations also calls ts.tables.

I don't have the bandwidth right now to make a proper test... But it seems this won't matter after the changes to tskit's memory usage goes in, right @benjeffery?

mufernando avatar Dec 10 '21 06:12 mufernando

There shouldn't be any interaction between ts.tables and garbage collection @mufernando --- it's just making a copy now where we want it to return a view.

jeromekelleher avatar Dec 10 '21 09:12 jeromekelleher

Here's an example:

import msprime

ts = msprime.sim_ancestry(10**4, sequence_length=1000, random_seed=1234)

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

Then,

mprof run sim_mutations_mem.py
mprof plot -o mutations-mem.png

gives

mutations-mem

Looks pretty solid to me?

jeromekelleher avatar Dec 10 '21 09:12 jeromekelleher

Are you getting the same result @mufernando? Would be interesting of your system is accumulating for this example.

benjeffery avatar Dec 10 '21 11:12 benjeffery

I ran the same example (with a msprime tree sequence) and didn't run into issues.

mutations-mem

Then I ran a minimal example with three sim_mutation calls to one of my tree sequences (from SLiM, so with a ton of metadata everywhere). You can see the memory usage only grows over time.

mutations-slim

Lastly, I added some manual garbage collection in between sim_mutation calls and that seems to help quite a bit.

mutations-slim-gc

With some number of objects being accumulated in between calls:

Before: collected 0 objects.
After first call: collected 84 objects.
After second call: collected 84 objects.
After third call: collected 84 objects.

This was my script:

import tskit
import msprime
import gc

ts = tskit.load("../../output/E082AD5JLBXZ4U6/great-apes_E082AD5JLBXZ4U6_rep0.trees")

collected = gc.collect()
print("Before: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1, keep=False)
collected = gc.collect()
print("After first call: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=2, keep=False)
collected = gc.collect()
print("After second call: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=2, keep=False)
collected = gc.collect()
print("After third call: collected", collected, "objects.", flush=True)

and I can share the tree sequence, but it's 250M.

mufernando avatar Dec 10 '21 17:12 mufernando

ping @molpopgen?

petrelharp avatar Dec 10 '21 18:12 petrelharp

Well, let me try to break a tie and see if I make a plot that looks more like @jeromekelleher or more like @mufernando.

molpopgen avatar Dec 10 '21 19:12 molpopgen

Very nice, thanks @mufernando

jeromekelleher avatar Dec 10 '21 19:12 jeromekelleher

import tskit
import msprime

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

for j in range(5):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

mutations-mem

Tie goes to @mufernando, I think? Did this with a file I'm working with at the moment.

molpopgen avatar Dec 10 '21 19:12 molpopgen

For completeness, I repeated my example w/gc.collect() after each bout of mutating.

mutations-mem2

molpopgen avatar Dec 10 '21 19:12 molpopgen

is this call in sim_mutations the issue?

https://github.com/tskit-dev/msprime/blob/c034a5d24a39ad888a603a2c2042298ad0f8265f/msprime/mutations.py#L1369

andrewkern avatar Dec 10 '21 19:12 andrewkern

The main difference between @jeromekelleher's example and the others seems to be the presence of metadata in the latter, which must be a clue?

molpopgen avatar Dec 11 '21 03:12 molpopgen

Just looking at your plot again @mufernando - is that really 20s to load a 250mb file? Damn that's slow!

jeromekelleher avatar Dec 11 '21 06:12 jeromekelleher

I agree with Kevin, it's certainly pointing towards the presence of metadata. I guess making a smaller reproducible example is the next step.

jeromekelleher avatar Dec 11 '21 06:12 jeromekelleher

Just looking at your plot again @mufernando - is that really 20s to load a 250mb file? Damn that's slow!

Hmm. Yeah. In mine, it is 14 sec for 1.5 gigs.

molpopgen avatar Dec 11 '21 08:12 molpopgen

Well, adding some top-level metadata doesn't do anything. This example has a flat memory profile:

ts = msprime.sim_ancestry(10**4, sequence_length=1000, random_seed=1234)
tables = ts.dump_tables()
tables.metadata_schema = tskit.MetadataSchema.permissive_json()
tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

jeromekelleher avatar Dec 11 '21 20:12 jeromekelleher

These experiments also gave a flat profile:

import msprime
import numpy as np
import tskit

ts = msprime.sim_ancestry(10 ** 4, sequence_length=1000, random_seed=1234)
ts = msprime.sim_mutations(ts, rate=1, random_seed=1)
tables = ts.dump_tables()

# This gave a flat profile
# tables.individuals.clear()
# ind = np.where(tables.nodes.individual != tskit.NULL)[0]
#
# tables.individuals.metadata_schema = tskit.MetadataSchema.permissive_json()
# for i, j in zip(ind, ind[1:]):
#    tables.individuals.add_row(metadata={"nodes": [int(i), int(j)]})

muts = tskit.MutationTable()
muts.metadata_schema = tskit.MetadataSchema.permissive_json()
for m in tables.mutations:
    muts.add_row(
        site=m.site,
        node=m.node,
        derived_state=m.derived_state,
        parent=m.parent,
        time=m.time,
        metadata={"x": [1, 2, 3, 4]},
    )

tables.mutations.metadata_schema = muts.metadata_schema
tables.mutations.set_columns(
    site=muts.site,
    node=muts.node,
    derived_state=muts.derived_state,
    derived_state_offset=muts.derived_state_offset,
    parent=muts.parent,
    time=muts.time,
    metadata=muts.metadata,
    metadata_offset=muts.metadata_offset,
)

# tables.metadata_schema = tskit.MetadataSchema.permissive_json()
# tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()

for j in range(500):
    mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)

molpopgen avatar Dec 11 '21 20:12 molpopgen

Note to self to try and use the methods outlined here to try to find reference cycles in the examples leading to growing memory use.

molpopgen avatar Dec 11 '21 23:12 molpopgen

After changing my script to what is shown below, I got an absolutely massive log of stuff. The log is also large if I skip the mutation step. However, the difference between the 2 is lots of variables called ll_table, many copies of various mutation, individual, etc., tables. When I modified the script to gc.collect() right after load and then run:

mprof run mutatification.py |grep -c MutationTable

I got a count of 30. That seems way too many?

The script:

import gc
import sys
import tskit
import msprime

gc.set_debug(gc.DEBUG_SAVEALL)

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

for j in range(5):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

gc.collect()
for i in gc.garbage:
    print(i)

molpopgen avatar Dec 13 '21 23:12 molpopgen

This version gives a count of 6 mutation tables:

import gc
import sys
import tskit
import msprime

gc.set_debug(gc.DEBUG_SAVEALL)

ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")

gc.collect()

for j in range(1):
    mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)

del ts
del mts

gc.collect()
for i in gc.garbage:
    print(i)

molpopgen avatar Dec 13 '21 23:12 molpopgen

Hmmmmm....

jeromekelleher avatar Dec 14 '21 08:12 jeromekelleher

Yeah, my intuition is no good on this. I think this is with tskit 0.3.7 installed, FWIW.

molpopgen avatar Dec 14 '21 12:12 molpopgen

Thanks, this is very helpful. I'll dig in when I get a chance. Identifying if we have a refcount cycle will be very helpful and motivating in the read-only tables work: https://github.com/tskit-dev/tskit/pull/2057

jeromekelleher avatar Dec 14 '21 13:12 jeromekelleher

I haven't had luck finding a runtime analysis tool that'll print out reference cycles. Do you know of one?

molpopgen avatar Dec 14 '21 14:12 molpopgen

No but I figure if gc finds things in simple cases like this that's evidence they exist?

jeromekelleher avatar Dec 14 '21 15:12 jeromekelleher

I believe that's correct. But I want a back trace analog here. I want to see where they're coming from. But I'd also like a unicorn I guess.

molpopgen avatar Dec 14 '21 15:12 molpopgen