ipyrad
ipyrad copied to clipboard
denovo+reference/denovo-reference not supported in v0.9 (yet)
Hello, it looks like the hybrid assembly approach outlined in the documentation is not supported for the 'ddrad' datatype in v0.19 and v0.20 (as is removed from the params file option). is this functionality deprecated from the earlier versions or has this always been unsupported with regard to single end reads? Any alternative solutions would be much appreciated, thanks!
output (command: ipyrad -p params-test.txt -s 1234 -c 8 -f):
ipyrad [v.0.9.20] Interactive assembly and analysis of RAD-seq data
Parallel connection | LAPTOP-xxxxxxxx: 8 cores
Step 1: Loading sorted fastq data to Samples [####################] 100% 0:00:09 | loading reads 2 fastq files loaded to 2 Samples.
Step 2: Filtering and trimming reads [####################] 100% 0:01:29 | processing reads
Step 3: Clustering/Mapping reads within samples [####################] 100% 0:00:12 | indexing reference
Encountered an Error. Message: datatype + assembly_method combo not currently supported.
Parallel connection closed.
Hello, The v0.7 to v0.9 version upgrade included a major overhaul of the internals of step 3. The hybrid denovo+reference method is supported for all datatypes, but we haven't finished polishing this assembly method for the new version, so it's currently hidden. Shouldn't be too long before it's ready for a test drive, but with the holidays and all hard to put a timeline on it. I'll leave this issue open becuase, yeah, it's a known problem and we should fix it. -isaac
Awesome, thanks for the clarification!
Hey all, I was wondering if there are any imminent plans to release a new ipyrad version where the denovo+reference method can be used? I have some analyses I am trying to finalize for which it would be useful, basically trying to decide if I should move forward without this option or if there are plans to re-implement it soon. Thanks!
@ajbarley There are no imminent plans to implement denovo+reference at this point. This is a 'would be nice' feature, which for boring reasons is actually rather tricky, so it stays low on the pile. I will speculate that this will be fixed within one calendar year plus/minus a year ;)
Sounds good, thanks for the update @isaacovercast!
Hi @isaacovercast , just wondering is the denovo+reference supported now or soon? I just tried my data it told me not supported, and thinking it would be great if it will be supported soon. Cheers
Hi @ChuanLego, I see that I previously prognosticated 1 year +/- 1 year as the soft 'deadline' for when this feature would be added, and we're approaching that date. At this point the denovo+reference method is still on the low-priority pile, unfortunately. I agree it would be great to have, but the amount of work it would take to reimplement is far higher than the benefit that would be obtained from having it available. It's an edge case that I would love to handle, but I don't see it happening any time soon, sorry to say.
Hi @isaacovercast, just wondering if there is any news regarding this feature? Cheers!
Hi @jogijsbers, unfortunately there has not been any motion on this still. The denovo-reference method can be implemented with the reference_as_filter
parameter. The denovo+reference method in practice doesn't recover much different data than a standard denovo or reference assembly, so it's still something on the low priority list for me. Let me know if you have any questions about performing an assembly with or without reference, if this might help you proceed. All the best!
Hi @isaacovercast, one issue regarding this topic. I'm running ipyrad 0.9.50 and I've tried to run it with the reference_as_filter parameter for filtering chloroplast sequences. However, the run stops in step 3 because it seems that it does not find the reference file in spite that it is located in the main folder (I've tried also adding ./ at the beginning in the params file but same result). Cytinus.salida.txt params-Cytinus.txt I wonder whether you can give some advice on what I'm doing wrong. All the best!
@phlomitero v0.9.50 is pretty old, there's a reasonable chance this problem has been fixed already. Can you please update to the most recent version (0.9.93) and try again?
Hi @isaacovercast , v.0.9.50 is the one we have installed in the cluster. I'm running a subset of samples in a local computer with version 0.9.92 and seems to go fine (I'll ask for an update in the cluster). However, I've realized that step 3 clustering/mapping is far slower than the standard denovo assembly without the reference_as_filter option. Am I right? Thanks a lot for the answer and for keeping this wonderful software!! All the best!
@phlomitero Wonderful, glad it is working on your local computer and thanks for the positive feedback! Step 3 should be faster with the reference_as_filter option, but there are conditions where it could be somewhat slower. How much slower is 'far slower'? Are you using the same computer and the same number of cores for the w/ vs w/o reference_as_filter runs?
@isaacovercast I have been using ipyrad v.0.9.43 and I have been using the reference flag (#5 assembly method) with a rather large genome (11gb), with a single plate of ddrad (single end) data (96 samples). The issue that I am having is that after 3 days on a HPC with the maximum number of nodes that can be requested (26) it reaches 50% and never makes it any further even though there is plenty of walltime left. Do you have any advice on how to get this to actually finish? ANy information would be greatly appreciated. Thanks.
@perryleewoodjr What sub-step of step 3 is it reaching 50% on? Can you post the job submission script? Is it 26 'nodes' (using MPI) or 26 cores on 1 node? If it is stuck in indexing the reference sequence, it won't matter how many nodes are used because this part of the process doesn't use MPI. If the genome is huge and the amount of ram allocated is not sufficient then the process will be very slow as it will run out of memory and start paging to disk (which will be painfully slow). I suspect this is what's happening. Can you allocate more RAM? You can also ssh to the compute node running the ipyrad process and look at 'top' and 'free' to see if you can figure out more what's happening.
Here is the output:
Step 3: Clustering/Mapping reads within samples [########## ] 50% 3 days, 0:00:18 | indexing reference
Here is the bash script:
#!/bin/bash #SBATCH --time=72:00:00 # walltime #SBATCH --nodes=1 #SBATCH --ntasks=26 #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=10G # memory per CPU #SBATCH --mail-type=BEGIN #SBATCH --mail-type=END #SBATCH --mail-type=FAIL
module load mpi
PARAMS=$1 STEP=$2 OUT=$3
ipyrad -p $PARAMS -s $STEP -c 26 -f 1> $OUT 2>&1
cd $SLURM_SUBMIT_DIR
exit 0
I will check top and free.
Thank you!
Yeah, 10GB might not be enough for an 11GB genome. One trick that you could try is getting an interactive session on the cluster with a big chunk of memory and then running bwa index
on the reference sequence by hand. If ipyrad finds the index files then it will skip this part of the process.
Yeah, it seems like it is has a high virtual memory requested. We have some high memory nodes that I can try. I will check out bwa index. Please let me know if i am interpreting this correctly. I should index the reference genome by hand (separately) using bwa Index. Then run ipyrad and hopefully it will skip the indexing step.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
252059 XXXX 20 0 165984 4516 1920 R 1.6 0.0 0:06.25 top
I really appreciate your help.
"I should index the reference genome by hand (separately) using bwa Index. Then run ipyrad and hopefully it will skip the indexing step." <- Yes, that is correct.
Good luck, let me know how it goes.
Some warning or small section about the current inability to use some datatypes with specific assembly methods should be added to the documentation. Is this limited just to the ddrad denovo+reference mode?
@TheGreatJack Thanks for the suggestion, I updated the docs to specify that we don't support these methods any more and also to add details about the reference_as_filter
parameter:
https://ipyrad.readthedocs.io/en/master/6-params.html#assembly-method
@isaacovercast Sorry for not answering before but I have a doubt with the speed of the assembly method (denovo vs. reference). I have the same dataset running on a 40 cores cluster with "denovo" as the assembly option, and as you can see the times are as follows:
ipyrad [v.0.9.92] Interactive assembly and analysis of RAD-seq data
Parallel connection | nodo92: 40 cores
Step 1: Loading sorted fastq data to Samples
[####################] 100% 0:06:05 | loading reads
286 fastq files loaded to 143 Samples.
Step 2: Filtering and trimming reads [####################] 100% 0:26:51 | processing reads
Step 3: Clustering/Mapping reads within samples
[####################] 100% 0:38:34 | join merged pairs
[####################] 100% 0:17:56 | join unmerged pairs
[####################] 100% 0:34:57 | dereplicating
[################### ] 99% 17 days, 13:15:16 | clustering/mapping
it is still running....
And the same dataset is running on my local machine with 20 cores and the assembly method is set to "reference" because I have used a small chloroplast genome (about 150 Kb) as a reference. And the times are really, really slow in the latter case:
ipyrad [v.0.9.94] Interactive assembly and analysis of RAD-seq data
Parallel connection | rafa-Precision-3660: 20 cores
Step 1: Loading sorted fastq data to Samples
[####################] 100% 0:12:43 | loading reads
286 fastq files loaded to 143 Samples.
Step 2: Filtering and trimming reads [####################] 100% 6:07:05 | processing reads
Step 3: Clustering/Mapping reads within samples
[####################] 100% 0:00:01 | indexing reference
[####################] 100% 10:54:27 | join unmerged pairs
[####################] 100% 18:34:56 | dereplicating
[####################] 100% 10:30:44 | splitting dereps
I didn't expect such a difference, even if I'm using half the number of cores because I thought using a reference speeds the process. Further, when the assembly method is set to "reference" the size of the temporal files increases greatly (in fact, it consumes all the space in my cluster account -1TB- and then stops).
Is there any suggestion you can provide me in order to speed the analysis? Surely I'm doing something wrong.... Thanks!
Sorry, the bold case was not intentional but due to the sequence of dashes.....
@phlomitero The runtime on your local computer with 20 cores and the reference assembly method is almost certainly because of underallocation of RAM. You will need at least 4GB of free RAM per core (so 80GB of free RAM). With Paired end data it could be more than 4GB. If the cores do not have enough RAM and the data is very large then it will go VERY slowly. Also, step 3 has several substeps which happen before the reference alignment, and these steps happen in both the denovo and reference assembly method, so the step 3 running on your laptop hasn't even reached the point of using the reference yet (in the example run that you sent).
In reference assemblies the size of temporary files is definitely bigger, it's part of the trade-off for speed. There's nothing you can do about the file sizes except try to get a bigger disk allocation.
If you run this data as Single-end and use only R1 then it would make things faster and also the temp files would not be so large.
Understood! I will beg for more allocation memory! Thanks a lot!