msprime
msprime copied to clipboard
sim_mutations and garbage collection
I'm pretty sure sim_mutations
is accumulating some garbage same as pyslim (see #pyslim211). Note how sim_mutations
also calls ts.tables
.
I don't have the bandwidth right now to make a proper test... But it seems this won't matter after the changes to tskit's memory usage goes in, right @benjeffery?
There shouldn't be any interaction between ts.tables
and garbage collection @mufernando --- it's just making a copy now where we want it to return a view.
Here's an example:
import msprime
ts = msprime.sim_ancestry(10**4, sequence_length=1000, random_seed=1234)
for j in range(500):
mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)
Then,
mprof run sim_mutations_mem.py
mprof plot -o mutations-mem.png
gives
Looks pretty solid to me?
Are you getting the same result @mufernando? Would be interesting of your system is accumulating for this example.
I ran the same example (with a msprime tree sequence) and didn't run into issues.
Then I ran a minimal example with three sim_mutation
calls to one of my tree sequences (from SLiM, so with a ton of metadata everywhere). You can see the memory usage only grows over time.
Lastly, I added some manual garbage collection in between sim_mutation
calls and that seems to help quite a bit.
With some number of objects being accumulated in between calls:
Before: collected 0 objects.
After first call: collected 84 objects.
After second call: collected 84 objects.
After third call: collected 84 objects.
This was my script:
import tskit
import msprime
import gc
ts = tskit.load("../../output/E082AD5JLBXZ4U6/great-apes_E082AD5JLBXZ4U6_rep0.trees")
collected = gc.collect()
print("Before: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1, keep=False)
collected = gc.collect()
print("After first call: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=2, keep=False)
collected = gc.collect()
print("After second call: collected", collected, "objects.", flush=True)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=2, keep=False)
collected = gc.collect()
print("After third call: collected", collected, "objects.", flush=True)
and I can share the tree sequence, but it's 250M.
ping @molpopgen?
Well, let me try to break a tie and see if I make a plot that looks more like @jeromekelleher or more like @mufernando.
Very nice, thanks @mufernando
import tskit
import msprime
ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")
for j in range(5):
mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)
Tie goes to @mufernando, I think? Did this with a file I'm working with at the moment.
For completeness, I repeated my example w/gc.collect()
after each bout of mutating.
is this call in sim_mutations
the issue?
https://github.com/tskit-dev/msprime/blob/c034a5d24a39ad888a603a2c2042298ad0f8265f/msprime/mutations.py#L1369
The main difference between @jeromekelleher's example and the others seems to be the presence of metadata in the latter, which must be a clue?
Just looking at your plot again @mufernando - is that really 20s to load a 250mb file? Damn that's slow!
I agree with Kevin, it's certainly pointing towards the presence of metadata. I guess making a smaller reproducible example is the next step.
Just looking at your plot again @mufernando - is that really 20s to load a 250mb file? Damn that's slow!
Hmm. Yeah. In mine, it is 14 sec for 1.5 gigs.
Well, adding some top-level metadata doesn't do anything. This example has a flat memory profile:
ts = msprime.sim_ancestry(10**4, sequence_length=1000, random_seed=1234)
tables = ts.dump_tables()
tables.metadata_schema = tskit.MetadataSchema.permissive_json()
tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()
for j in range(500):
mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)
These experiments also gave a flat profile:
import msprime
import numpy as np
import tskit
ts = msprime.sim_ancestry(10 ** 4, sequence_length=1000, random_seed=1234)
ts = msprime.sim_mutations(ts, rate=1, random_seed=1)
tables = ts.dump_tables()
# This gave a flat profile
# tables.individuals.clear()
# ind = np.where(tables.nodes.individual != tskit.NULL)[0]
#
# tables.individuals.metadata_schema = tskit.MetadataSchema.permissive_json()
# for i, j in zip(ind, ind[1:]):
# tables.individuals.add_row(metadata={"nodes": [int(i), int(j)]})
muts = tskit.MutationTable()
muts.metadata_schema = tskit.MetadataSchema.permissive_json()
for m in tables.mutations:
muts.add_row(
site=m.site,
node=m.node,
derived_state=m.derived_state,
parent=m.parent,
time=m.time,
metadata={"x": [1, 2, 3, 4]},
)
tables.mutations.metadata_schema = muts.metadata_schema
tables.mutations.set_columns(
site=muts.site,
node=muts.node,
derived_state=muts.derived_state,
derived_state_offset=muts.derived_state_offset,
parent=muts.parent,
time=muts.time,
metadata=muts.metadata,
metadata_offset=muts.metadata_offset,
)
# tables.metadata_schema = tskit.MetadataSchema.permissive_json()
# tables.metadata = {j: list(range(100)) for j in range(100)}
ts = tables.tree_sequence()
for j in range(500):
mts = msprime.sim_mutations(ts, rate=1, random_seed=j + 1)
Note to self to try and use the methods outlined here to try to find reference cycles in the examples leading to growing memory use.
After changing my script to what is shown below, I got an absolutely massive log of stuff. The log is also large if I skip the mutation step. However, the difference between the 2 is lots of variables called ll_table
, many copies of various mutation, individual, etc., tables. When I modified the script to gc.collect()
right after load and then run:
mprof run mutatification.py |grep -c MutationTable
I got a count of 30. That seems way too many?
The script:
import gc
import sys
import tskit
import msprime
gc.set_debug(gc.DEBUG_SAVEALL)
ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")
for j in range(5):
mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)
gc.collect()
for i in gc.garbage:
print(i)
This version gives a count of 6 mutation tables:
import gc
import sys
import tskit
import msprime
gc.set_debug(gc.DEBUG_SAVEALL)
ts = tskit.load("treefile_even_larger_effect_equilibrium_higmu.trees")
gc.collect()
for j in range(1):
mts = msprime.sim_mutations(ts, rate=1e-9, random_seed=j + 1)
del ts
del mts
gc.collect()
for i in gc.garbage:
print(i)
Hmmmmm....
Yeah, my intuition is no good on this. I think this is with tskit 0.3.7 installed, FWIW.
Thanks, this is very helpful. I'll dig in when I get a chance. Identifying if we have a refcount cycle will be very helpful and motivating in the read-only tables work: https://github.com/tskit-dev/tskit/pull/2057
I haven't had luck finding a runtime analysis tool that'll print out reference cycles. Do you know of one?
No but I figure if gc finds things in simple cases like this that's evidence they exist?
I believe that's correct. But I want a back trace analog here. I want to see where they're coming from. But I'd also like a unicorn I guess.