tsinfer
tsinfer copied to clipboard
lmdb out of memory error on tutorial example
Hi, I keep getting an out of memory error even when running the toy example in the tutorial (under the msprime simulate). I can run the simulation step and produce the simulation-source.trees (8.2Mb on disk). However, I get an out of memory error when I try to load the trees into a SampleData object ... the resulting file is 1T. I am running tsinfer 0.1.4 and python 3.7. I installed using pip (python -m pip install tsinfer) and can import msprime, tsinfer and tskit without issue (install log below). thanks, @stsmall
tsinfer ls simulation-source.trees path = simulation-source.trees size = 8.2 MiB edges = 158227 trees = 36734 sites = 39001 mutations = 39001
###This is the specific commands from the tutorial that I am running:
progress = tqdm.tqdm(total=ts.num_sites) with tsinfer.SampleData( path="simulation.samples", sequence_length=ts.sequence_length, num_flush_threads=2) as sample_data: for var in ts.variants(): sample_data.add_site(var.site.position, var.genotypes, var.alleles) progress.update() progress.close()
###Error:
MemoryError Traceback (most recent call last)
~/anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/formats.py in init(self, sequence_length, **kwargs) 702 def init(self, sequence_length=0, **kwargs): 703 --> 704 super().init(**kwargs) 705 self.data.attrs["sequence_length"] = float(sequence_length) 706 chunks = self._chunk_size,
~/anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/formats.py in init(self, path, num_flush_threads, compressor, chunk_size) 254 self.path = path 255 if path is not None: --> 256 store = self._new_lmdb_store() 257 self.data = zarr.open_group(store=store, mode="w") 258 self.data.attrs[FORMAT_NAME_KEY] = self.FORMAT_NAME
~/anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/formats.py in _new_lmdb_store(self) 312 # The existence of a lock-file can confuse things, so delete it. 313 remove_lmdb_lockfile(self.path) --> 314 return zarr.LMDBStore(self.path, subdir=False) 315 316 @classmethod
~/anaconda3/envs/allel/lib/python3.7/site-packages/zarr/storage.py in init(self, path, buffers, **kwargs) 1685 1686 # open database -> 1687 self.db = lmdb.open(path, **kwargs) 1688 1689 # store properties
MemoryError:simulation.samples: Cannot allocate memory
The install log
anaconda3/envs/allel/bin/pip install tsinfer
Processing ./.cache/pip/wheels/82/bd/bb/899331434dc60a63213a134706592840cd00cfa0dfff21431e/tsinfer-0.1.4-cp37-cp37m-linux_x86_64.whl
Requirement already satisfied: numcodecs>=0.6 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (0.7.2)
Requirement already satisfied: attrs in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (20.2.0)
Requirement already satisfied: numpy in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (1.19.2)
Collecting daiquiri
Using cached daiquiri-2.1.1-py2.py3-none-any.whl (17 kB)
Requirement already satisfied: lmdb in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (0.96)
Requirement already satisfied: sortedcontainers in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (2.2.2)
Requirement already satisfied: tqdm in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (4.48.2)
Requirement already satisfied: six in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (1.15.0)
Collecting humanize
Downloading humanize-3.1.0-py3-none-any.whl (69 kB)
|████████████████████████████████| 69 kB 750 kB/s
Requirement already satisfied: msprime>=0.6.1 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (0.7.3)
Requirement already satisfied: zarr>=2.2 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tsinfer) (2.4.0)
Processing ./.cache/pip/wheels/23/dd/19/e4d469fda8630bd61842b747618daa93bc332131e5e6851154/python_json_logger-2.0.1-py34-none-any.whl
Requirement already satisfied: setuptools in ./anaconda3/envs/allel/lib/python3.7/site-packages (from humanize->tsinfer) (49.6.0.post20201009)
Requirement already satisfied: tskit in ./anaconda3/envs/allel/lib/python3.7/site-packages (from msprime>=0.6.1->tsinfer) (0.3.2)
Requirement already satisfied: asciitree in ./anaconda3/envs/allel/lib/python3.7/site-packages (from zarr>=2.2->tsinfer) (0.3.3)
Requirement already satisfied: fasteners in ./anaconda3/envs/allel/lib/python3.7/site-packages (from zarr>=2.2->tsinfer) (0.14.1)
Requirement already satisfied: jsonschema>=3.0.0 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tskit->msprime>=0.6.1->tsinfer) (3.2.0)
Requirement already satisfied: svgwrite>=1.1.10 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tskit->msprime>=0.6.1->tsinfer) (1.4)
Requirement already satisfied: h5py>=2.6.0 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from tskit->msprime>=0.6.1->tsinfer) (2.10.0)
Requirement already satisfied: monotonic>=0.1 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from fasteners->zarr>=2.2->tsinfer) (1.5)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in ./anaconda3/envs/allel/lib/python3.7/site-packages (from jsonschema>=3.0.0->tskit->msprime>=0.6.1->tsinfer) (2.0.0)
Requirement already satisfied: pyrsistent>=0.14.0 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from jsonschema>=3.0.0->tskit->msprime>=0.6.1->tsinfer) (0.17.3)
Requirement already satisfied: zipp>=0.5 in ./anaconda3/envs/allel/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->jsonschema>=3.0.0->tskit->msprime>=0.6.1->tsinfer) (3.3.1)
Installing collected packages: python-json-logger, daiquiri, humanize, tsinfer
Successfully installed daiquiri-2.1.1 humanize-3.1.0 python-json-logger-2.0.1 tsinfer-0.1.4
What OS are you on?
Hi, Sorry for the quick edit. I am working on Linux machine. thanks, @stsmall
specifically: AME="Red Hat Enterprise Linux Server" VERSION="7.9 (Maipo)"
Windows sometimes has problems allocating the large sparse files used by Zarr (and your example seems to be failing right at the start of allocating the memory, which is constant with that). But Linux should have no problems, so that's weird. I assume you are using the correct ts (what do you get from ts.num_sites and ts.num_samples)?
By the way, can you do tsinfer.SampleData.from_tree_sequence(ts)?
Edit - the 1TB thing may be a red herring - the sparse files allocated by tsinfer appear to take up huge amounts of space, but I think they actually don't. How much disk space do you have available anyway? The final simulation.samples file is about 22MB on my machine.
In [3]: ts.num_sites Out[3]: 39001
In [4]: ts.num_samples Out[4]: 10000
yes, loading the tree sequence seems to work In [5]: tsinfer.SampleData.from_tree_sequence(ts) Out[5]: <tsinfer.formats.SampleData at 0x7ff5aae16b90>
oddly if I run this: tsinfer.SampleData.from_tree_sequence(ts, path="simulation_samples", num_flush_threads=2) then I get the memory error again
disk space ... 5T
disk space ... 5T
Well, that's no problem then! I suspect @jeromekelleher (or maybe @benjeffery ) might have more of a clue why it's failing. I seem to have lmdb version 0.99, which is more recent than yours - maybe worth doing python3 -m pip uninstall lmdb && python3 -m pip install lmdb to update to a newer one, if available (any maybe the same with zarr?)
OK, I will try that and let you know.
Successfully installed lmdb-1.0.0 Successfully installed zarr-2.5.0
Didnt seem to make a difference and I still am getting the memory error.
seems to be specific to using the path keyword tsinfer.SampleData.from_tree_sequence(ts, num_flush_threads=2) # no error tsinfer.SampleData.from_tree_sequence(ts, path="samples", num_flush_threads=2) # memory error
If I run from the command line version on the trees from msprime:
$ tsinfer infer simulation-source.trees
2020-10-20 15:35:37,373 [32391] CRITICAL root: Traceback (most recent call last):
File "anaconda3/envs/allel/lib/python3.7/site-packages/tsinfer/formats.py", line 291, in _open_lmbd_readonly
self.path, map_size=map_size, readonly=True, subdir=False, lock=False)
File "anaconda3/envs/allel/lib/python3.7/site-packages/zarr/storage.py", line 1860, in init
self.db = lmdb.open(path, **kwargs)
lmdb.InvalidError: simulation-source.trees: MDB_INVALID: File is not an LMDB file
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "anaconda3/envs/allel/bin/tsinfer", line 8, in
Didnt seem to make a difference and I still am getting the memory error.
Thanks for trying with an updated lmdb.
seems to be specific to using the path keyword
Yes, it will be when you use path, as that's when tsinfer allocates a file to save the SampleData instance - otherwise it just keeps it in memory.
$ tsinfer infer simulation-source.trees
You can't run tsinfer infer on a .trees file (i.e. a tree sequence). You have to run it on a SampleData file (that's the thing that you are trying to create).
From a previous comment that you changed, it looks like the file might be being saved in /scratch365/blah/blah/blah/blah/ - I wonder if there's not enough space on that filesystem? When you give a path, can you provide a path to a file which you know will be saved where there is lots of space? It might also matter, I suppose, what the filesystem format is (like, nfs, ext3/4, etc).
Sorry about that, I was trying everything :)
I edited to remove the identifying info.
The scratch space is where I have 5T of empty space. I have had some tar files in the 300-400G range, maybe its the system reading it as 1T that the problem? I agree that I dont think the file is actually that big ...
Sorry - to clarify - can you run something like (substituting in the correct paths)
with tsinfer.SampleData(
path="/absolute/path/to/capacious/filesystem/simulation.samples", sequence_length=ts.sequence_length,
num_flush_threads=2) as sample_data:
Or even
with tsinfer.SampleData(
path="/absolute/path/to/local/filesystem/simulation.samples", sequence_length=ts.sequence_length,
num_flush_threads=2) as sample_data:
The scratch space is where I have 5T of empty space. I have had some tar files in the 300-400G range, maybe its the system reading it as 1T that the problem? I agree that I dont think the file is actually that big ...
a. Can you provide the path arg as an absolute ref to a file in that scratch space (path="/scratch365/blah/blah/blah/blah/simulation.samples")? I'm not sure where you are running the python commands from?
b. What's the filesystem used on the scratch space - I'm guessing it some networked thing?
OK, I ran:
with tsinfer.SampleData(path="/scratch365/username/simulation.samples", sequence_length=ts.sequence_length, num_flush_threads=2) as sample_data: for var in ts.variants(): sample_data.add_site(var.site.position, var.genotypes, var.alleles)
Still the same error. I tried absolute path as well. The file system is panfs, I also tried to write to another file system, afs, that returned the same error. Thanks for all your help btw!
Thanks for all your help btw!
Sorry it's not been much use, though. I think it's probably some filesystem thing - maybe it doesn't like allocating 1TB sparse files on panfs or afs. You can run the demos without using the path arg, and if you need to actually save a file, I think you can just do my_memory_stored_sample_data.copy(path="new_path").
OK, thanks I will try that. The system has 500G of memory, so that shouldnt be a problem. I will just have to pre-apologize to the other users.
I tried on my laptop and it worked as expected. It doesnt have enough memory or ram to run my real data. I will contact the sys admins about the issue. Let me know if @jeromekelleher or @benjeffery have an idea that we didnt try. thanks again for all your help! @stsmall
Oh, I forgot, you can also use the max_file_size argument. Could you check if this works:
with tsinfer.SampleData(
path="/scratch365/username/simulation.samples",
sequence_length=ts.sequence_length,
num_flush_threads=2,
max_file_size=2**30, # 1GiB
) as sample_data:
We should note that this could be a problem on unusual file systems, I guess.
Hmm, its telling me that
TypeError: __init__() got an unexpected keyword argument 'max_file_size'
I only see options of: path, sequence_length, num_flush_threads, compressor, chunk_size
In [9]: tsinfer.version Out[9]: '0.1.4'
Ah, that's in the latest version, sorry (https://tsinfer.readthedocs.io/en/latest/api.html#file-formats). You can try the new version via python -m pip install git+https://github.com/tskit-dev/tsinfer if you want, but it sounds like you can work around the issue anyway.
Hi @hyanwong, passing 2**30 to the new version as max_file_size runs the msprime simulation example successfully. With my real data, will I need to estimate the final file size to call that option? thanks, @stsmall
Yes, I guess you'll need to experiment with what max_size works. There's minor discussion in the Note section in the latest docs. If not on windows, we set the default max size to 1TB, assuming no-one is going to want bigger than that (the actual file gets shrunk to the required size once constructed). You can probably set max_file_size to the size of your VCF file as a first approximation.
OK, great. Thanks!
P.s. if you are reading in large VCF files, then there's some demo code to do so in parallel at https://github.com/tskit-dev/tsinfer/issues/277#issuecomment-652024871
Oh, another thing @stsmall - I wonder if the latest tsinfer version works without the max_file_size argument anyway? Perhaps the older version asked for indefinitely large files, or something, which panfs didn't like?
no, with newer version, meaning same 'Cannot allocate memory' error. It works if I use the max_file_size option.
Can you normally create a 1TB file on that scratch space? Or are there e.g. limits on the size of files that users can create?
I'll probably close this issue and open up another that has a more meaningful name, if that's OK.