drep
drep copied to clipboard
Advice on handling memory errors
Hi @MrOlm
I ran into a memory error while attempting to dereplicate a dataset of ~300k prokaryotic genomes
Do you have any advice on how to best get around this? I was thinking about splitting the database by family/genus (this info is already available) and dereplicating each individually before recombining?
The command I used and error I generated are below for info.
Thanks :) Calum
dRep dereplicate Dereplicated_99/ -g names.txt --genomeInfo genomequality.csv -p 50 --S_algorithm fastANI --S_ani 0.99 --multiround_primary_clustering
Traceback (most recent call last):
File "/home/cwwalsh/miniconda3/envs/drep/bin/dRep", line 32, in <module>
Controller().parseArguments(args)
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/controller.py", line 100, in parseArguments
self.dereplicate_operation(**vars(args))
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/controller.py", line 48, in dereplicate_operation
drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_workflows.py", line 37, in dereplicate_wrapper
drep.d_cluster.controller.d_cluster_wrapper(wd, **kwargs)
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_cluster/controller.py", line 179, in d_cluster_wrapper
GenomeClusterController(workDirectory, **kwargs).main()
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_cluster/controller.py", line 32, in main
self.run_primary_clustering()
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_cluster/controller.py", line 100, in run_primary_clustering
Mdb, Cdb, cluster_ret = drep.d_cluster.compare_utils.all_vs_all_MASH(self.Bdb, self.wd.get_dir('MASH'), **self.kwargs)
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 120, in all_vs_all_MASH
return run_second_round_clustering(Bdb, genome_chunks, data_folder, verbose=True, **kwargs)
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 244, in run_second_round_clustering
Mdb = pd.concat(mdbs).reset_index(drop=True)
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/frame.py", line 5793, in reset_index
new_obj = self.copy()
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/generic.py", line 6032, in copy
data = self._mgr.copy(deep=deep)
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 615, in copy
res._consolidate_inplace()
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 1685, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 2084, in _consolidate
merged_blocks = _merge_blocks(
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 2111, in _merge_blocks
new_values = np.vstack([b.values for b in blocks]) # type: ignore[misc]
File "<__array_function__ internals>", line 180, in vstack
File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/numpy/core/shape_base.py", line 282, in vstack
return _nx.concatenate(arrs, 0)
File "<__array_function__ internals>", line 180, in concatenate
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 51.5 GiB for an array with shape (3, 2306205540) and data type object
Hi Calum,
Yes, the step that takes the most RAM is the clustering, which is where you are running out of RAM.
Other than first splitting by genus / family (which is a good idea and would probably solve this problem!) you could try decreasing the --primary_chunksize
parameter to 1000 or so.
Also, just as a heads up, you may run into problems with RAM during secondary clustering as well. It all depends on how many similar genomes you have, which is something that is database-specific.
I would also recommend adding the flag --run_tertiary_clustering
when using --multiround_primary_clustering
Best, Matt
Thanks very much Matt. I'll try the taxonomic splitting first and then reducing the primary chunk size if that fails. I wasn't aware of the tertiary clustering option (my attention span is probably short than the help page). I'll give that a try too. Thanks again, Calum
@MrOlm sorry to resurrect this closed issue.
I want to run tertiary clustering on a dataset that has already been dereplicated using multiround primary clustering - can I just rerun dRep using the same command but with --run_tertiary_clustering
added? (I have seen in other scenarios that you have written dRep in such a way that it will resume from a previous run).
Thanks, Calum
Hi @cazzlewazzle89 - I expect that this will work, but I'm not 100% positive. I have tried to write dRep to pick up where it left off, but it doesn't always work. I would just give it a shot; if it doesn't crash I would trust the results.