NGSpeciesID icon indicating copy to clipboard operation
NGSpeciesID copied to clipboard

Different output multiple samples (with for loop) and single sample

Open LunavdL opened this issue 2 years ago • 5 comments

Hi,

I have been using NGSpeciesID on a microbiome dataset with 90 samples (one fastq file per sample). I use a for loop to loop through these samples with the following command:

for file in *.fastq; do
  bn=`basename $file .fastq`
  NGSpeciesID --ont --consensus --abundance_ratio 0.002 --racon --racon_iter 2 --fastq $file --outfolder ${bn}

While most samples have well over 20 different consensus sequences, I noticed a few samples had 0 or very few consensus sequences. For these samples, I ran NGSpeciesID on the individual samples, using for example: NGSpeciesID --ont --consensus --abundance_ratio 0.002 --racon --racon_iter 2 --fastq barcode01.fastq --outfolder barcode01.fastq When I did that, these samples suddenly also had >20 final consensus sequences.

I then ran the for loop again on the same dataset, and out of 90 samples, 5 samples had a very different number of consensus sequences (when comparing the first for-loop outcome and the second for-loop outcome). Sample 22 for example had 29 consensus sequences in the first run and only 3 in the second run (consensus_reference_1.fasta, consensus_reference_32.fasta, and consensus_reference_165.fasta). The number of reads assigned to the 3 consensus sequences in the second run is exactly the same as to consensus ID 1, 32, and 165, respectively in the first run. So it appears like a number of reads have just been skipped.

Do you have any clue why this is happening?

Best, Luna

LunavdL avatar Jul 12 '21 07:07 LunavdL

Hi @LunavdL,

The algorithm should be deterministic. If there is some randomness caused in the code I should fix it. However, the variability in output may be caused by something else. The best way for me to figure it out is if I can try it myself using one of the datasets where you observe this variability. Let me know if this is possible.

What you can try is to add rm -r racon_dir before the NGSpeciesID command, where racon_dir is the path to the folder with the racon output. This will make sure racon does not use any old files from previous runs. I assume racon does not do this, but could be worth checking.

Best, Kristoffer

ksahlin avatar Jul 12 '21 11:07 ksahlin

Hi @ksahlin,

Thank you for your fast reply!

Yes, I can send a dataset I have been working with. What would be the easiest way to send it?

I ran the loop on 10 samples again yesterday, this time writing the terminal output to a text file. The second sample did not have any consensus sequences (38 consensus sequences in both previous runs). The output seems normal until the creation of clusters:

STARTING TO CREATE CLUSTER CONSENSUS

Temporary workdirektory for consensus and polishing: /tmp/tmp3f5kn3j2 creating center of 8634 sequences. creating center of 8362 sequences. creating center of 7532 sequences. creating center of 7448 sequences. creating center of 7275 sequences. creating center of 6936 sequences. creating center of 6313 sequences. creating center of 4202 sequences. creating center of 3898 sequences. creating center of 2167 sequences. creating center of 2016 sequences.

After that, it stops and then continues with the next sample.

Best, Luna

LunavdL avatar Jul 13 '21 06:07 LunavdL

Depends on how large the dataset is. From the output, it looks like the dataset is too large to send gzipped over email. Perhaps if you have google drive you can upload the data there and send me a link to one of my email addresses. Or Dropbox? Otherwise, I found this site https://transfer.pcloud.com/ . Have not tried it though.

It would be great if I could get a minimal failing instance.

ksahlin avatar Jul 14 '21 07:07 ksahlin

I uploaded 25 fastq files through , so you have probably received an email about that by now. I ran the pipeline several times on the same dataset and the problem occurred at different samples every time, so I could not pinpoint the problem to a specific file. This makes it very hard to reproduce it. Every time I ran the loop, I had at least one 'no-consensus-sequences'-event in the first 10-20 samples, so I hope the 25 files I've send will be enough.

Thank you again!

LunavdL avatar Jul 14 '21 12:07 LunavdL

I sent an email to your uni email address about this.

ksahlin avatar Jul 14 '21 12:07 ksahlin