aviary refine_dastool is stalling in aviary recover

Hi aviary developers,

Thanks for developing this great tool. I'm currently running Aviary v0.11.0, which includes Rosella v0.5.4, and I'm encountering an issue with the refine_dastool step. It appears to stall, running on only one thread and continuing for over multiple days before I kill it or slurm kills it because nothing is happening on the server I run it on.

Here is the log file of refine_dastool

`INFO: 00:03:13 - Refining iteration 0

INFO: 00:03:13 - Rosella refine iteration 0 [2025-06-05T22:03:13Z INFO rosella] rosella version 0.5.4 [2025-06-05T22:03:32Z INFO rosella::refine::refinery] Reading TNF table. [2025-06-05T22:03:33Z INFO rosella::refine::refinery] Beginning refinement of 10 MAGs`

After this point, the job continues with no further output only occupying 1 thread on the server and ultimately gets killed after four days by the slurm because it exceeded the timelimit. I've repeated the run with the same and other data multiple times and encountered the same behavior consistently.

Here is the general aviary out file from this run:

Aviary_bins_MLS_6225532_stderr.txt

Do you have any idea what is going on or how to solve this? If you need more information, please let me know.

Thanks!

Best,

Anna

Jun 10 '25 12:06 Anna-MarieSeelen

By the way, the refinement of rosella, semibin and metabat2 finishes without any errors.

And this is the das_tool.log, which looks fine to me as well:

das_tool.log

Jun 10 '25 12:06 Anna-MarieSeelen

Hey Anna,

I'm guessing one or multiple of those DAStool bins being sent for refinement is massive. Would you be able to generate some summary statistics on those MAGs and maybe pinpoint which of them is causing the issue? Things to look for would be number of contigs, genome size, level of contamination, contig size distribution

Cheers, Rhys

Jun 10 '25 22:06 rhysnewell

Hi Rhys,

Thanks for getting back to me so quickly! I appreciate the explanation. It makes sense now. So, each thread is handling a single bin, and if one of those bins is particularly large, it can significantly delay the process, making it seem like everything has stalled.

I checked the checkm.out file, and it turns out there are indeed one or two massive bins. One of them is 77 million base pairs with a contamination level of 1115% (not a typo!). The second largest is around 12 million base pairs with 162% contamination.

In your experience, what would you consider the upper limit for genome size before it starts to noticeably slow down the pipeline?

Also, do you have any suggestions for how to handle this situation? I’ve tried manually killing the thread that’s still running, which does allow the process to continue, but I’ve also run into errors later in the pipeline due to missing output files.

Looking forward to your thoughts!

Best,

Anna

Jun 11 '25 13:06 Anna-MarieSeelen

Yep, that 77 MB bin would definitely be the issue. I don't have hard values for when things to start to break, but I'd say anything about 30 MB should not be handled with a single thread. It's hard to know though, as sometimes these larger bins do process quite easily. It's more dependent on how many contigs are present in the bin than the size. Do you know how many contigs are in each of these bins?

I think a check should be performed by aviary before refinement that removes these excessively fragmented bins from being refined to stop these semi-deadlock situations. We'll figure something out for this.

As for now, you could run the pipeline up to the DAStool step using -w das_tool and then manually run whatever singlem/coverm tools you want on that set of bins. Not ideal, but you might already have those pre-refinement bins sitting there for use.

Cheers, Rhys

Jun 11 '25 23:06 rhysnewell

Good suggestion, thanks! The 77-million basepair bin contains 40,000 contigs, while the 12-million basepair bin has 6,000. Maybe I'll try this approach first: run the pipeline up to the checkm_das_tool rule to obtain the CheckM stats for each bin and determine the size of the MAGs. After that, I'll remove MAGs that are too large, continue running the pipeline up to das_tool_refine. Then, I'll reintroduce the larger MAGs and proceed to the end of the pipeline. Although this approach is somewhat cumbersome, I really like the bin_info file that Aviary generates, as it summarizes data from all tools. It would be great to retain that information for all MAGs in the final output.

If this doesn't work, I'll try a less intensive approach.

Thanks!

Best,

Anna

Jun 13 '25 13:06 Anna-MarieSeelen