tsinfer lmdb out of memory error on tutorial example

Hi, I keep getting an out of memory error even when running the toy example in the tutorial (under the msprime simulate). I can run the simulation step and produce the simulation-source.trees (8.2Mb on disk). However, I get an out of memory error when I try to load the trees into a SampleData object ... the resulting file is 1T. I am running tsinfer 0.1.4 and python 3.7. I installed using pip (python -m pip install tsinfer) and can import msprime, tsinfer and tskit without issue (install log below). thanks, @stsmall

tsinfer ls simulation-source.trees path = simulation-source.trees size = 8.2 MiB edges = 158227 trees = 36734 sites = 39001 mutations = 39001

###This is the specific commands from the tutorial that I am running:

progress = tqdm.tqdm(total=ts.num_sites) with tsinfer.SampleData( path="simulation.samples", sequence_length=ts.sequence_length, num_flush_threads=2) as sample_data: for var in ts.variants(): sample_data.add_site(var.site.position, var.genotypes, var.alleles) progress.update() progress.close()

###Error:

MemoryError Traceback (most recent call last) in 1 progress = tqdm.tqdm(total=ts.num_sites) ----> 2 with tsinfer.SampleData(path="simulation.samples", sequence_length=ts.sequence_length, num_flush_threads=2) as sample_data: 3 for var in ts.variants(): 4 sample_data.add_site(var.site.position, var.genotypes, var.alleles) 5 progress.update()

~/anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/formats.py in init(self, sequence_length, **kwargs) 702 def init(self, sequence_length=0, **kwargs): 703 --> 704 super().init(**kwargs) 705 self.data.attrs["sequence_length"] = float(sequence_length) 706 chunks = self._chunk_size,

~/anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/formats.py in init(self, path, num_flush_threads, compressor, chunk_size) 254 self.path = path 255 if path is not None: --> 256 store = self._new_lmdb_store() 257 self.data = zarr.open_group(store=store, mode="w") 258 self.data.attrs[FORMAT_NAME_KEY] = self.FORMAT_NAME

~/anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/formats.py in _new_lmdb_store(self) 312 # The existence of a lock-file can confuse things, so delete it. 313 remove_lmdb_lockfile(self.path) --> 314 return zarr.LMDBStore(self.path, subdir=False) 315 316 @classmethod

~/anaconda3/envs/allel/lib/python3.7/site-packages/zarr/storage.py in init(self, path, buffers, **kwargs) 1685 1686 # open database -> 1687 self.db = lmdb.open(path, **kwargs) 1688 1689 # store properties

MemoryError:simulation.samples: Cannot allocate memory

The install log

anaconda3/envs/allel/bin/pip install tsinfer
Processing ./.cache/pip/wheels/82/bd/bb/899331434dc60a63213a134706592840cd00cfa0dfff21431e/tsinfer-0.1.4-cp37-cp37m-linux_x86_64.whl Requirement already satisfied: numcodecs>=0.6 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (0.7.2) Requirement already satisfied: attrs in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (20.2.0) Requirement already satisfied: numpy in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (1.19.2) Collecting daiquiri Using cached daiquiri-2.1.1-py2.py3-none-any.whl (17 kB) Requirement already satisfied: lmdb in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (0.96) Requirement already satisfied: sortedcontainers in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (2.2.2) Requirement already satisfied: tqdm in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (4.48.2) Requirement already satisfied: six in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (1.15.0) Collecting humanize Downloading humanize-3.1.0-py3-none-any.whl (69 kB) |████████████████████████████████| 69 kB 750 kB/s Requirement already satisfied: msprime>=0.6.1 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (0.7.3) Requirement already satisfied: zarr>=2.2 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (2.4.0) Processing ./.cache/pip/wheels/23/dd/19/e4d469fda8630bd61842b747618daa93bc332131e5e6851154/python_json_logger-2.0.1-py34-none-any.whl Requirement already satisfied: setuptools in ./anaconda3/envs/allel/lib/python3.7/site-packages (from humanize->tsinfer) (49.6.0.post20201009) Requirement already satisfied: tskit in ./anaconda3/envs/allel/lib/python3.7/site-packages (from msprime>=0.6.1->tsinfer) (0.3.2) Requirement already satisfied: asciitree in ./anaconda3/envs/allel/lib/python3.7/site-packages (from zarr>=2.2->tsinfer) (0.3.3) Requirement already satisfied: fasteners in ./anaconda3/envs/allel/lib/python3.7/site-packages (from zarr>=2.2->tsinfer) (0.14.1) Requirement already satisfied: jsonschema>=3.0.0 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tskit->msprime>=0.6.1->tsinfer) (3.2.0) Requirement already satisfied: svgwrite>=1.1.10 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tskit->msprime>=0.6.1->tsinfer) (1.4) Requirement already satisfied: h5py>=2.6.0 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tskit->msprime>=0.6.1->tsinfer) (2.10.0) Requirement already satisfied: monotonic>=0.1 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from fasteners->zarr>=2.2->tsinfer) (1.5) Requirement already satisfied: importlib-metadata; python_version < "3.8" in ./anaconda3/envs/allel/lib/python3.7/site-packages (from jsonschema>=3.0.0->tskit->msprime>=0.6.1->tsinfer) (2.0.0) Requirement already satisfied: pyrsistent>=0.14.0 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from jsonschema>=3.0.0->tskit->msprime>=0.6.1->tsinfer) (0.17.3) Requirement already satisfied: zipp>=0.5 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->jsonschema>=3.0.0->tskit->msprime>=0.6.1->tsinfer) (3.3.1) Installing collected packages: python-json-logger, daiquiri, humanize, tsinfer Successfully installed daiquiri-2.1.1 humanize-3.1.0 python-json-logger-2.0.1 tsinfer-0.1.4

Oct 20 '20 18:10 stsmall

What OS are you on?

Oct 20 '20 19:10 hyanwong

Hi, Sorry for the quick edit. I am working on Linux machine. thanks, @stsmall

Oct 20 '20 19:10 stsmall

specifically: AME="Red Hat Enterprise Linux Server" VERSION="7.9 (Maipo)"

Oct 20 '20 19:10 stsmall

Windows sometimes has problems allocating the large sparse files used by Zarr (and your example seems to be failing right at the start of allocating the memory, which is constant with that). But Linux should have no problems, so that's weird. I assume you are using the correct ts (what do you get from ts.num_sites and ts.num_samples)?

By the way, can you do tsinfer.SampleData.from_tree_sequence(ts)?

Edit - the 1TB thing may be a red herring - the sparse files allocated by tsinfer appear to take up huge amounts of space, but I think they actually don't. How much disk space do you have available anyway? The final simulation.samples file is about 22MB on my machine.

Oct 20 '20 19:10 hyanwong

In [3]: ts.num_sites Out[3]: 39001

In [4]: ts.num_samples Out[4]: 10000

yes, loading the tree sequence seems to work In [5]: tsinfer.SampleData.from_tree_sequence(ts) Out[5]: <tsinfer.formats.SampleData at 0x7ff5aae16b90>

oddly if I run this: tsinfer.SampleData.from_tree_sequence(ts, path="simulation_samples", num_flush_threads=2) then I get the memory error again

Oct 20 '20 19:10 stsmall

disk space ... 5T

Oct 20 '20 19:10 stsmall

disk space ... 5T

Well, that's no problem then! I suspect @jeromekelleher (or maybe @benjeffery ) might have more of a clue why it's failing. I seem to have lmdb version 0.99, which is more recent than yours - maybe worth doing python3 -m pip uninstall lmdb && python3 -m pip install lmdb to update to a newer one, if available (any maybe the same with zarr?)

Oct 20 '20 19:10 hyanwong

OK, I will try that and let you know.

Oct 20 '20 19:10 stsmall

Successfully installed lmdb-1.0.0 Successfully installed zarr-2.5.0

Didnt seem to make a difference and I still am getting the memory error.

Oct 20 '20 19:10 stsmall

seems to be specific to using the path keyword tsinfer.SampleData.from_tree_sequence(ts, num_flush_threads=2) # no error tsinfer.SampleData.from_tree_sequence(ts, path="samples", num_flush_threads=2) # memory error

Oct 20 '20 19:10 stsmall

If I run from the command line version on the trees from msprime:

$ tsinfer infer simulation-source.trees
2020-10-20 15:35:37,373 [32391] CRITICAL root: Traceback (most recent call last): File "anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/formats.py", line 291, in _open_lmbd_readonly self.path, map_size=map_size, readonly=True, subdir=False, lock=False) File "anaconda3/envs/allel/lib/python3.7/site-packages/zarr/storage.py", line 1860, in init self.db = lmdb.open(path, **kwargs) lmdb.InvalidError: simulation-source.trees: MDB_INVALID: File is not an LMDB file

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "anaconda3/envs/allel/bin/tsinfer", line 8, in sys.exit(main()) File "anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/main.py", line 5, in main cli.tsinfer_main() File "anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/cli.py", line 469, in tsinfer_main args.runner(args) File "anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/cli.py", line 175, in run_infer sample_data = tsinfer.SampleData.load(args.samples) File "anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/formats.py", line 325, in load self._open_readonly() File "anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/formats.py", line 301, in _open_readonly store = self._open_lmbd_readonly() File "anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/formats.py", line 294, in _open_lmbd_readonly "Unknown file format:{}".format(str(e))) from e tsinfer.exceptions.FileFormatError: Unknown file format:simulation-source.trees: MDB_INVALID: File is not an LMDB file

Oct 20 '20 19:10 stsmall

Didnt seem to make a difference and I still am getting the memory error.

Thanks for trying with an updated lmdb.

seems to be specific to using the path keyword

Yes, it will be when you use path, as that's when tsinfer allocates a file to save the SampleData instance - otherwise it just keeps it in memory.

$ tsinfer infer simulation-source.trees

You can't run tsinfer infer on a .trees file (i.e. a tree sequence). You have to run it on a SampleData file (that's the thing that you are trying to create).

From a previous comment that you changed, it looks like the file might be being saved in /scratch365/blah/blah/blah/blah/ - I wonder if there's not enough space on that filesystem? When you give a path, can you provide a path to a file which you know will be saved where there is lots of space? It might also matter, I suppose, what the filesystem format is (like, nfs, ext3/4, etc).

Oct 20 '20 19:10 hyanwong

Sorry about that, I was trying everything :)

I edited to remove the identifying info.

The scratch space is where I have 5T of empty space. I have had some tar files in the 300-400G range, maybe its the system reading it as 1T that the problem? I agree that I dont think the file is actually that big ...

Oct 20 '20 19:10 stsmall

Sorry - to clarify - can you run something like (substituting in the correct paths)

with tsinfer.SampleData(
    path="/absolute/path/to/capacious/filesystem/simulation.samples", sequence_length=ts.sequence_length,
    num_flush_threads=2) as sample_data:

Or even

with tsinfer.SampleData(
    path="/absolute/path/to/local/filesystem/simulation.samples", sequence_length=ts.sequence_length,
    num_flush_threads=2) as sample_data:

Oct 20 '20 19:10 hyanwong

The scratch space is where I have 5T of empty space. I have had some tar files in the 300-400G range, maybe its the system reading it as 1T that the problem? I agree that I dont think the file is actually that big ...

a. Can you provide the path arg as an absolute ref to a file in that scratch space (path="/scratch365/blah/blah/blah/blah/simulation.samples")? I'm not sure where you are running the python commands from?

b. What's the filesystem used on the scratch space - I'm guessing it some networked thing?

Oct 20 '20 20:10 hyanwong

OK, I ran: with tsinfer.SampleData(path="/scratch365/username/simulation.samples", sequence_length=ts.sequence_length, num_flush_threads=2) as sample_data: for var in ts.variants(): sample_data.add_site(var.site.position, var.genotypes, var.alleles)

Still the same error. I tried absolute path as well. The file system is panfs, I also tried to write to another file system, afs, that returned the same error. Thanks for all your help btw!

Oct 20 '20 20:10 stsmall

Thanks for all your help btw!

Sorry it's not been much use, though. I think it's probably some filesystem thing - maybe it doesn't like allocating 1TB sparse files on panfs or afs. You can run the demos without using the path arg, and if you need to actually save a file, I think you can just do my_memory_stored_sample_data.copy(path="new_path").

Oct 20 '20 20:10 hyanwong

OK, thanks I will try that. The system has 500G of memory, so that shouldnt be a problem. I will just have to pre-apologize to the other users.

Oct 20 '20 20:10 stsmall

I tried on my laptop and it worked as expected. It doesnt have enough memory or ram to run my real data. I will contact the sys admins about the issue. Let me know if @jeromekelleher or @benjeffery have an idea that we didnt try. thanks again for all your help! @stsmall

Oct 20 '20 20:10 stsmall

Oh, I forgot, you can also use the max_file_size argument. Could you check if this works:

with tsinfer.SampleData(
    path="/scratch365/username/simulation.samples",
    sequence_length=ts.sequence_length,
    num_flush_threads=2,
    max_file_size=2**30, # 1GiB
) as sample_data:

We should note that this could be a problem on unusual file systems, I guess.

Oct 20 '20 20:10 hyanwong

Hmm, its telling me that TypeError: __init__() got an unexpected keyword argument 'max_file_size' I only see options of: path, sequence_length, num_flush_threads, compressor, chunk_size

Oct 20 '20 20:10 stsmall

In [9]: tsinfer.version Out[9]: '0.1.4'

Oct 20 '20 20:10 stsmall

Ah, that's in the latest version, sorry (https://tsinfer.readthedocs.io/en/latest/api.html#file-formats). You can try the new version via python -m pip install git+https://github.com/tskit-dev/tsinfer if you want, but it sounds like you can work around the issue anyway.

Oct 20 '20 20:10 hyanwong

Hi @hyanwong, passing 2**30 to the new version as max_file_size runs the msprime simulation example successfully. With my real data, will I need to estimate the final file size to call that option? thanks, @stsmall

Oct 20 '20 20:10 stsmall

Yes, I guess you'll need to experiment with what max_size works. There's minor discussion in the Note section in the latest docs. If not on windows, we set the default max size to 1TB, assuming no-one is going to want bigger than that (the actual file gets shrunk to the required size once constructed). You can probably set max_file_size to the size of your VCF file as a first approximation.

Oct 20 '20 20:10 hyanwong

OK, great. Thanks!

Oct 20 '20 20:10 stsmall

P.s. if you are reading in large VCF files, then there's some demo code to do so in parallel at https://github.com/tskit-dev/tsinfer/issues/277#issuecomment-652024871

Oct 20 '20 20:10 hyanwong

Oh, another thing @stsmall - I wonder if the latest tsinfer version works without the max_file_size argument anyway? Perhaps the older version asked for indefinitely large files, or something, which panfs didn't like?

Oct 20 '20 20:10 hyanwong

no, with newer version, meaning same 'Cannot allocate memory' error. It works if I use the max_file_size option.

Oct 20 '20 21:10 stsmall

Can you normally create a 1TB file on that scratch space? Or are there e.g. limits on the size of files that users can create?

I'll probably close this issue and open up another that has a more meaningful name, if that's OK.

Oct 20 '20 21:10 hyanwong

tsinfer tsinfer copied to clipboard

lmdb out of memory error on tutorial example

The install log

tsinfer
tsinfer copied to clipboard