drep Advice on handling memory errors

Hi @MrOlm

I ran into a memory error while attempting to dereplicate a dataset of ~300k prokaryotic genomes

Do you have any advice on how to best get around this? I was thinking about splitting the database by family/genus (this info is already available) and dereplicating each individually before recombining?

The command I used and error I generated are below for info.

Thanks :) Calum

dRep dereplicate Dereplicated_99/ -g names.txt --genomeInfo genomequality.csv -p 50 --S_algorithm fastANI --S_ani 0.99 --multiround_primary_clustering

Traceback (most recent call last):
  File "/home/cwwalsh/miniconda3/envs/drep/bin/dRep", line 32, in <module>
    Controller().parseArguments(args)
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/controller.py", line 100, in parseArguments
    self.dereplicate_operation(**vars(args))
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/controller.py", line 48, in dereplicate_operation
    drep.d_workflows.dereplicate_wrapper(kwargs['work_directory'],**kwargs)
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_workflows.py", line 37, in dereplicate_wrapper
    drep.d_cluster.controller.d_cluster_wrapper(wd, **kwargs)
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_cluster/controller.py", line 179, in d_cluster_wrapper
    GenomeClusterController(workDirectory, **kwargs).main()
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_cluster/controller.py", line 32, in main
    self.run_primary_clustering()
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_cluster/controller.py", line 100, in run_primary_clustering
    Mdb, Cdb, cluster_ret = drep.d_cluster.compare_utils.all_vs_all_MASH(self.Bdb, self.wd.get_dir('MASH'), **self.kwargs)
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 120, in all_vs_all_MASH
    return run_second_round_clustering(Bdb, genome_chunks, data_folder, verbose=True, **kwargs)
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 244, in run_second_round_clustering
    Mdb = pd.concat(mdbs).reset_index(drop=True)
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/frame.py", line 5793, in reset_index
    new_obj = self.copy()
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/generic.py", line 6032, in copy
    data = self._mgr.copy(deep=deep)
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 615, in copy
    res._consolidate_inplace()
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 1685, in _consolidate_inplace
    self.blocks = tuple(_consolidate(self.blocks))
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 2084, in _consolidate
    merged_blocks = _merge_blocks(
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 2111, in _merge_blocks
    new_values = np.vstack([b.values for b in blocks])  # type: ignore[misc]
  File "<__array_function__ internals>", line 180, in vstack
  File "/home/cwwalsh/miniconda3/envs/drep/lib/python3.9/site-packages/numpy/core/shape_base.py", line 282, in vstack
    return _nx.concatenate(arrs, 0)
  File "<__array_function__ internals>", line 180, in concatenate
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 51.5 GiB for an array with shape (3, 2306205540) and data type object

Oct 05 '22 02:10 cazzlewazzle89

Hi Calum,

Yes, the step that takes the most RAM is the clustering, which is where you are running out of RAM.

Other than first splitting by genus / family (which is a good idea and would probably solve this problem!) you could try decreasing the --primary_chunksize parameter to 1000 or so.

Also, just as a heads up, you may run into problems with RAM during secondary clustering as well. It all depends on how many similar genomes you have, which is something that is database-specific.

I would also recommend adding the flag --run_tertiary_clustering when using --multiround_primary_clustering

Best, Matt

Oct 05 '22 16:10 MrOlm

Thanks very much Matt. I'll try the taxonomic splitting first and then reducing the primary chunk size if that fails. I wasn't aware of the tertiary clustering option (my attention span is probably short than the help page). I'll give that a try too. Thanks again, Calum

Oct 05 '22 21:10 cazzlewazzle89

@MrOlm sorry to resurrect this closed issue. I want to run tertiary clustering on a dataset that has already been dereplicated using multiround primary clustering - can I just rerun dRep using the same command but with --run_tertiary_clustering added? (I have seen in other scenarios that you have written dRep in such a way that it will resume from a previous run). Thanks, Calum

Oct 11 '22 08:10 cazzlewazzle89

Hi @cazzlewazzle89 - I expect that this will work, but I'm not 100% positive. I have tried to write dRep to pick up where it left off, but it doesn't always work. I would just give it a shot; if it doesn't crash I would trust the results.

Oct 11 '22 17:10 MrOlm

drep drep copied to clipboard

Advice on handling memory errors

drep
drep copied to clipboard