ipyrad icon indicating copy to clipboard operation
ipyrad copied to clipboard

denovo+reference/denovo-reference not supported in v0.9 (yet)

Open leonardslog opened this issue 5 years ago • 25 comments

Hello, it looks like the hybrid assembly approach outlined in the documentation is not supported for the 'ddrad' datatype in v0.19 and v0.20 (as is removed from the params file option). is this functionality deprecated from the earlier versions or has this always been unsupported with regard to single end reads? Any alternative solutions would be much appreciated, thanks!

output (command: ipyrad -p params-test.txt -s 1234 -c 8 -f):

ipyrad [v.0.9.20] Interactive assembly and analysis of RAD-seq data

Parallel connection | LAPTOP-xxxxxxxx: 8 cores

Step 1: Loading sorted fastq data to Samples [####################] 100% 0:00:09 | loading reads 2 fastq files loaded to 2 Samples.

Step 2: Filtering and trimming reads [####################] 100% 0:01:29 | processing reads

Step 3: Clustering/Mapping reads within samples [####################] 100% 0:00:12 | indexing reference

Encountered an Error. Message: datatype + assembly_method combo not currently supported.

Parallel connection closed.

leonardslog avatar Dec 20 '19 06:12 leonardslog

Hello, The v0.7 to v0.9 version upgrade included a major overhaul of the internals of step 3. The hybrid denovo+reference method is supported for all datatypes, but we haven't finished polishing this assembly method for the new version, so it's currently hidden. Shouldn't be too long before it's ready for a test drive, but with the holidays and all hard to put a timeline on it. I'll leave this issue open becuase, yeah, it's a known problem and we should fix it. -isaac

isaacovercast avatar Dec 20 '19 10:12 isaacovercast

Awesome, thanks for the clarification!

leonardslog avatar Dec 20 '19 20:12 leonardslog

Hey all, I was wondering if there are any imminent plans to release a new ipyrad version where the denovo+reference method can be used? I have some analyses I am trying to finalize for which it would be useful, basically trying to decide if I should move forward without this option or if there are plans to re-implement it soon. Thanks!

ajbarley avatar Jun 25 '20 17:06 ajbarley

@ajbarley There are no imminent plans to implement denovo+reference at this point. This is a 'would be nice' feature, which for boring reasons is actually rather tricky, so it stays low on the pile. I will speculate that this will be fixed within one calendar year plus/minus a year ;)

isaacovercast avatar Jun 25 '20 18:06 isaacovercast

Sounds good, thanks for the update @isaacovercast!

ajbarley avatar Jun 25 '20 19:06 ajbarley

Hi @isaacovercast , just wondering is the denovo+reference supported now or soon? I just tried my data it told me not supported, and thinking it would be great if it will be supported soon. Cheers

ChuanLego avatar May 19 '22 00:05 ChuanLego

Hi @ChuanLego, I see that I previously prognosticated 1 year +/- 1 year as the soft 'deadline' for when this feature would be added, and we're approaching that date. At this point the denovo+reference method is still on the low-priority pile, unfortunately. I agree it would be great to have, but the amount of work it would take to reimplement is far higher than the benefit that would be obtained from having it available. It's an edge case that I would love to handle, but I don't see it happening any time soon, sorry to say.

isaacovercast avatar May 19 '22 13:05 isaacovercast

Hi @isaacovercast, just wondering if there is any news regarding this feature? Cheers!

jogijsbers avatar Aug 14 '23 20:08 jogijsbers

Hi @jogijsbers, unfortunately there has not been any motion on this still. The denovo-reference method can be implemented with the reference_as_filter parameter. The denovo+reference method in practice doesn't recover much different data than a standard denovo or reference assembly, so it's still something on the low priority list for me. Let me know if you have any questions about performing an assembly with or without reference, if this might help you proceed. All the best!

isaacovercast avatar Aug 15 '23 14:08 isaacovercast

Hi @isaacovercast, one issue regarding this topic. I'm running ipyrad 0.9.50 and I've tried to run it with the reference_as_filter parameter for filtering chloroplast sequences. However, the run stops in step 3 because it seems that it does not find the reference file in spite that it is located in the main folder (I've tried also adding ./ at the beginning in the params file but same result). Cytinus.salida.txt params-Cytinus.txt I wonder whether you can give some advice on what I'm doing wrong. All the best!

phlomitero avatar Aug 16 '23 15:08 phlomitero

@phlomitero v0.9.50 is pretty old, there's a reasonable chance this problem has been fixed already. Can you please update to the most recent version (0.9.93) and try again?

isaacovercast avatar Aug 17 '23 17:08 isaacovercast

Hi @isaacovercast , v.0.9.50 is the one we have installed in the cluster. I'm running a subset of samples in a local computer with version 0.9.92 and seems to go fine (I'll ask for an update in the cluster). However, I've realized that step 3 clustering/mapping is far slower than the standard denovo assembly without the reference_as_filter option. Am I right? Thanks a lot for the answer and for keeping this wonderful software!! All the best!

phlomitero avatar Aug 19 '23 11:08 phlomitero

@phlomitero Wonderful, glad it is working on your local computer and thanks for the positive feedback! Step 3 should be faster with the reference_as_filter option, but there are conditions where it could be somewhat slower. How much slower is 'far slower'? Are you using the same computer and the same number of cores for the w/ vs w/o reference_as_filter runs?

isaacovercast avatar Aug 19 '23 15:08 isaacovercast

@isaacovercast I have been using ipyrad v.0.9.43 and I have been using the reference flag (#5 assembly method) with a rather large genome (11gb), with a single plate of ddrad (single end) data (96 samples). The issue that I am having is that after 3 days on a HPC with the maximum number of nodes that can be requested (26) it reaches 50% and never makes it any further even though there is plenty of walltime left. Do you have any advice on how to get this to actually finish? ANy information would be greatly appreciated. Thanks.

perryleewoodjr avatar Oct 30 '23 16:10 perryleewoodjr

@perryleewoodjr What sub-step of step 3 is it reaching 50% on? Can you post the job submission script? Is it 26 'nodes' (using MPI) or 26 cores on 1 node? If it is stuck in indexing the reference sequence, it won't matter how many nodes are used because this part of the process doesn't use MPI. If the genome is huge and the amount of ram allocated is not sufficient then the process will be very slow as it will run out of memory and start paging to disk (which will be painfully slow). I suspect this is what's happening. Can you allocate more RAM? You can also ssh to the compute node running the ipyrad process and look at 'top' and 'free' to see if you can figure out more what's happening.

isaacovercast avatar Oct 30 '23 17:10 isaacovercast

Here is the output:

Step 3: Clustering/Mapping reads within samples [########## ] 50% 3 days, 0:00:18 | indexing reference

Here is the bash script:

#!/bin/bash #SBATCH --time=72:00:00 # walltime #SBATCH --nodes=1 #SBATCH --ntasks=26 #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=10G # memory per CPU #SBATCH --mail-type=BEGIN #SBATCH --mail-type=END #SBATCH --mail-type=FAIL

module load mpi

PARAMS=$1 STEP=$2 OUT=$3

ipyrad -p $PARAMS -s $STEP -c 26 -f 1> $OUT 2>&1

cd $SLURM_SUBMIT_DIR

exit 0

I will check top and free.

Thank you!

perryleewoodjr avatar Oct 30 '23 17:10 perryleewoodjr

Yeah, 10GB might not be enough for an 11GB genome. One trick that you could try is getting an interactive session on the cluster with a big chunk of memory and then running bwa index on the reference sequence by hand. If ipyrad finds the index files then it will skip this part of the process.

isaacovercast avatar Oct 30 '23 17:10 isaacovercast

Yeah, it seems like it is has a high virtual memory requested. We have some high memory nodes that I can try. I will check out bwa index. Please let me know if i am interpreting this correctly. I should index the reference genome by hand (separately) using bwa Index. Then run ipyrad and hopefully it will skip the indexing step.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
252059 XXXX 20 0 165984 4516 1920 R 1.6 0.0 0:06.25 top

I really appreciate your help.

perryleewoodjr avatar Oct 30 '23 17:10 perryleewoodjr

"I should index the reference genome by hand (separately) using bwa Index. Then run ipyrad and hopefully it will skip the indexing step." <- Yes, that is correct.

Good luck, let me know how it goes.

isaacovercast avatar Oct 30 '23 18:10 isaacovercast

Some warning or small section about the current inability to use some datatypes with specific assembly methods should be added to the documentation. Is this limited just to the ddrad denovo+reference mode?

TheGreatJack avatar Nov 09 '23 20:11 TheGreatJack

@TheGreatJack Thanks for the suggestion, I updated the docs to specify that we don't support these methods any more and also to add details about the reference_as_filter parameter:

https://ipyrad.readthedocs.io/en/master/6-params.html#assembly-method

isaacovercast avatar Nov 10 '23 10:11 isaacovercast

@isaacovercast Sorry for not answering before but I have a doubt with the speed of the assembly method (denovo vs. reference). I have the same dataset running on a 40 cores cluster with "denovo" as the assembly option, and as you can see the times are as follows:

ipyrad [v.0.9.92] Interactive assembly and analysis of RAD-seq data

Parallel connection | nodo92: 40 cores

Step 1: Loading sorted fastq data to Samples [####################] 100% 0:06:05 | loading reads
286 fastq files loaded to 143 Samples.

Step 2: Filtering and trimming reads [####################] 100% 0:26:51 | processing reads

Step 3: Clustering/Mapping reads within samples [####################] 100% 0:38:34 | join merged pairs
[####################] 100% 0:17:56 | join unmerged pairs
[####################] 100% 0:34:57 | dereplicating
[################### ] 99% 17 days, 13:15:16 | clustering/mapping it is still running....

And the same dataset is running on my local machine with 20 cores and the assembly method is set to "reference" because I have used a small chloroplast genome (about 150 Kb) as a reference. And the times are really, really slow in the latter case:

ipyrad [v.0.9.94] Interactive assembly and analysis of RAD-seq data

Parallel connection | rafa-Precision-3660: 20 cores

Step 1: Loading sorted fastq data to Samples [####################] 100% 0:12:43 | loading reads
286 fastq files loaded to 143 Samples.

Step 2: Filtering and trimming reads [####################] 100% 6:07:05 | processing reads

Step 3: Clustering/Mapping reads within samples [####################] 100% 0:00:01 | indexing reference
[####################] 100% 10:54:27 | join unmerged pairs
[####################] 100% 18:34:56 | dereplicating
[####################] 100% 10:30:44 | splitting dereps

I didn't expect such a difference, even if I'm using half the number of cores because I thought using a reference speeds the process. Further, when the assembly method is set to "reference" the size of the temporal files increases greatly (in fact, it consumes all the space in my cluster account -1TB- and then stops).

Is there any suggestion you can provide me in order to speed the analysis? Surely I'm doing something wrong.... Thanks!

phlomitero avatar Mar 07 '24 12:03 phlomitero

Sorry, the bold case was not intentional but due to the sequence of dashes.....

phlomitero avatar Mar 07 '24 12:03 phlomitero

@phlomitero The runtime on your local computer with 20 cores and the reference assembly method is almost certainly because of underallocation of RAM. You will need at least 4GB of free RAM per core (so 80GB of free RAM). With Paired end data it could be more than 4GB. If the cores do not have enough RAM and the data is very large then it will go VERY slowly. Also, step 3 has several substeps which happen before the reference alignment, and these steps happen in both the denovo and reference assembly method, so the step 3 running on your laptop hasn't even reached the point of using the reference yet (in the example run that you sent).

In reference assemblies the size of temporary files is definitely bigger, it's part of the trade-off for speed. There's nothing you can do about the file sizes except try to get a bigger disk allocation.

If you run this data as Single-end and use only R1 then it would make things faster and also the temp files would not be so large.

isaacovercast avatar Mar 07 '24 14:03 isaacovercast

Understood! I will beg for more allocation memory! Thanks a lot!

phlomitero avatar Mar 08 '24 08:03 phlomitero