gossamer
gossamer copied to clipboard
most input reads mysteriously discarded from xenome output
I am trying to use Xenome to filter xenograft mouse reads from graft human reads. I am running into the issue that xenome classify finishes in seconds and returns kilobyte size output files without any discernible error message. Most the the reads seem to have been discarded and I can't figure out where they have gone or why this may have happened.
Is there any reason you can think of why this may have happened? I am using xenome 1.0.0.
This is the command I ran:
xenome classify --tmp-dir /fastscratch/XXX -v -P index --pairs -i /XXXX/Sample_17-86718-3/17-86718-3_S7_L002_R1_001.fastq.gz -i /XXXX/Sample_17-86718-3/17-86718-3_S7_L002_R2_001.fastq.gz --output-filename-prefix /XXX/86718-3 --graft-name human --host-name mouse --log-file logfile > statsfile
This is the output to the log file:
Tue Oct 10 11:05:14 2017 info opening buffer 0 /fastscratch/XXX/1507647914-50853-0-classbuf-0
Tue Oct 10 11:05:14 2017 info performing 1 pass
Tue Oct 10 11:05:14 2017 info writing to /XXXX/86718-3/86718-3_neither_1.fastq
Tue Oct 10 11:05:14 2017 info writing to /XXXX/86718-3/86718-3_neither_2.fastq
Tue Oct 10 11:05:14 2017 info writing to /XXXX/86718-3/86718-3_both_1.fastq
Tue Oct 10 11:05:14 2017 info writing to /XXXX/86718-3/86718-3_both_2.fastq
Tue Oct 10 11:05:14 2017 info writing to /XXXX/86718-3/86718-3_ambiguous_1.fastq
Tue Oct 10 11:05:14 2017 info writing to /XXXX/86718-3/86718-3_ambiguous_2.fastq
Tue Oct 10 11:05:14 2017 info writing to /XXXX/86718-3/86718-3_human_1.fastq
Tue Oct 10 11:05:14 2017 info writing to /XXXX/86718-3/86718-3_human_2.fastq
Tue Oct 10 11:05:14 2017 info writing to /XXXX/86718-3/86718-3_mouse_1.fastq
Tue Oct 10 11:05:14 2017 info writing to /XXXX/86718-3/86718-3_mouse_2.fastq
Tue Oct 10 11:05:14 2017 info pass 0
Tue Oct 10 11:05:18 2017 info parsing sequences from /XXXX/Sample_17-86718-3/17-86718-3_S7_L002_R1_001.fastq.gz
Tue Oct 10 11:05:18 2017 info parsing sequences from /XXXX/Sample_17-86718-3/17-86718-3_S7_L002_R2_001.fastq.gz
Tue Oct 10 11:09:18 2017 info total elapsed time: 244.56612205505371
Tue Oct 10 11:09:18 2017 info total elapsed time: 244.57243013381958
This is the stdout:
Statistics
B G H M count percent class
0 0 0 0 250 85.034 "neither"
0 0 0 1 0 0 "both"
0 0 1 0 22 7.48299 "definitely mouse"
0 0 1 1 1 0.340136 "probably mouse"
0 1 0 0 15 5.10204 "definitely human"
0 1 0 1 2 0.680272 "probably human"
0 1 1 0 0 0 "ambiguous"
0 1 1 1 0 0 "ambiguous"
1 0 0 0 0 0 "both"
1 0 0 1 0 0 "probably both"
1 0 1 0 0 0 "definitely mouse"
1 0 1 1 1 0.340136 "probably mouse"
1 1 0 0 0 0 "definitely human"
1 1 0 1 1 0.340136 "probably human"
1 1 1 0 0 0 "ambiguous"
1 1 1 1 2 0.680272 "ambiguous"
Summary
count percent class
18 6.12245 human
24 8.16327 mouse
0 0 both
250 85.034 neither
2 0.680272 ambiguous
The input fastq files are large:
total 6.6G
-rwxr-xr-- 1 XXX XXX 2.8G Oct 6 13:04 17-86718-3_S7_L002_R1_001.fastq.gz
-rwxr-xr-- 1 XXX XXX 3.1G Oct 6 13:03 17-86718-3_S7_L002_R2_001.fastq.gz
The output files are tiny:
-rw-r--r-- 1 XXX XXX 445 Oct 10 11:05 86718-3_ambiguous_1.fastq
-rw-r--r-- 1 XXX XXX 445 Oct 10 11:05 86718-3_ambiguous_2.fastq
-rw-r--r-- 1 XXX XXX 0 Oct 10 11:05 86718-3_both_1.fastq
-rw-r--r-- 1 XXX XXX 0 Oct 10 11:05 86718-3_both_2.fastq
-rw-r--r-- 1 XXX XXX 4.0K Oct 10 11:05 86718-3_human_1.fastq
-rw-r--r-- 1 XXX XXX 4.0K Oct 10 11:05 86718-3_human_2.fastq
-rw-r--r-- 1 XXX XXX 1.9K Oct 10 11:09 86718-3.log
-rw-r--r-- 1 XXX XXX 5.3K Oct 10 11:05 86718-3_mouse_1.fastq
-rw-r--r-- 1 XXX XXX 5.3K Oct 10 11:05 86718-3_mouse_2.fastq
-rw-r--r-- 1 XXX XXX 55K Oct 10 11:09 86718-3_neither_1.fastq
-rw-r--r-- 1 XXX XXX 55K Oct 10 11:09 86718-3_neither_2.fastq
-rw-r--r-- 1 XXX XXX 631 Oct 10 11:09 86718-3.stats.txt
The index took surprising little time to create, and also seems a bit small perhaps, but I don't know what it should be....
-rwxrwx--- 1 XXX XXX 24 Oct 9 14:25 index-both.header
-rwxrwx--- 1 XXX XXX 21M Oct 9 14:25 index-both.kmers-d0
-rwxrwx--- 1 XXX XXX 6.4M Oct 9 14:25 index-both.kmers-d1
-rwxrwx--- 1 XXX XXX 56 Oct 9 14:25 index-both.kmers.header
-rwxrwx--- 1 XXX XXX 32M Oct 9 14:25 index-both.kmers.high-bits
-rwxrwx--- 1 XXX XXX 371M Oct 9 14:25 index-both.kmers.low-bits.lwr
-rwxrwx--- 1 XXX XXX 186M Oct 9 14:25 index-both.kmers.low-bits.upr
-rwxrwx--- 1 XXX XXX 24M Oct 9 14:29 index-both.lhs-bits
-rwxrwx--- 1 XXX XXX 24M Oct 9 14:29 index-both.rhs-bits
-rwxrwx--- 1 XXX XXX 24 Oct 9 14:23 index-graft.header
-rwxrwx--- 1 XXX XXX 3.1M Oct 9 14:23 index-graft.kmers-d0
-rwxrwx--- 1 XXX XXX 4.5M Oct 9 14:23 index-graft.kmers-d1
-rwxrwx--- 1 XXX XXX 56 Oct 9 14:23 index-graft.kmers.header
-rwxrwx--- 1 XXX XXX 20M Oct 9 14:23 index-graft.kmers.high-bits
-rwxrwx--- 1 XXX XXX 187M Oct 9 14:23 index-graft.kmers.low-bits.lwr
-rwxrwx--- 1 XXX XXX 94M Oct 9 14:23 index-graft.kmers.low-bits.upr
-rwxrwx--- 1 XXX XXX 24 Oct 9 14:24 index-host.header
-rwxrwx--- 1 XXX XXX 2.4M Oct 9 14:24 index-host.kmers-d0
-rwxrwx--- 1 XXX XXX 4.9M Oct 9 14:24 index-host.kmers-d1
-rwxrwx--- 1 XXX XXX 56 Oct 9 14:24 index-host.kmers.header
-rwxrwx--- 1 XXX XXX 20M Oct 9 14:24 index-host.kmers.high-bits
-rwxrwx--- 1 XXX XXX 184M Oct 9 14:24 index-host.kmers.low-bits.lwr
-rwxrwx--- 1 XXX XXX 92M Oct 9 14:24 index-host.kmers.low-bits.upr
-rwxrwx--- 1 XXX XXX 6.9K Oct 9 14:29 xenome-index_stdout.txt
This is the command I used to create the index:
xenome index -M 24 -T 8 -P index -H Mus_musculus.GRCm38.dna_rm.primary_assembly.fa.gz -G Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa.gz
I was able to track down a copy of xenome 1.0.1 via a co-worker and am having the same issue with that version.
Thank you! That helps a lot.
I've been looking for an incompatibility, but it sounds like it's actually a longstanding bug.
Everyone has probably worked out by now that you can make the bug go away, at the cost of it taking longer, by running the "index" command single-threaded:
xenome index -T 1
So, that did not seem to fix the problem.
I tried running xenome index -T 1
and xenome index -T 1 -M 24
as per your suggestion and it does not seem to do anything.
The most interesting part is that the file size of the output index files does not change after the program starts outputting lines of text, ie.
XX XX XX 0.0
...
...
138698 158334981 194003348 81.6146
139591 159383557 194003348 82.1551
140479 160432135 194003348 82.6955
141363 161480711 194003348 83.236
142378 162529283 194003348 83.7765
143564 163577862 194003348 84.317
144345 164626437 194003348 84.8575
145444 165675015 194003348 85.398
146574 166723590 194003348 85.9385
147543 167772165 194003348 86.479
148680 168820741 194003348 87.0195
149831 169869318 194003348 87.56
150791 170917894 194003348 88.1005
151746 171966469 194003348 88.641
152629 173015046 194003348 89.1815
153548 174063622 194003348 89.722
154478 175112205 194003348 90.2625
155617 176160772 194003348 90.803
156774 177209352 194003348 91.3435
157654 178257925 194003348 91.8839
158474 179306501 194003348 92.4244
159568 180355077 194003348 92.9649
160592 181403651 194003348 93.5054
161414 182452229 194003348 94.0459
162274 183500805 194003348 94.5864
163365 184549382 194003348 95.1269
164541 185597957 194003348 95.6674
165461 186646534 194003348 96.2079
166521 187695115 194003348 96.7484
167539 188743688 194003348 97.2889
168645 189792261 194003348 97.8294
169635 190840838 194003348 98.3699
170640 191889419 194003348 98.9104
171758 192937996 194003348 99.4509
172809 193986571 194003348 99.9914
172823 194003348 194003348 100
So the file size of the output files did not change between 0 and 100 being printed to stdout, even tho at least 40 minutes to an hour passed.
EDIT: I am now trying to run xenome index
using different reference genomes, previously I had been using Ensembl "dna_rm" genomes because it seems appropriate to drop repeat regions for my use but I am now trying with "dna_sm" which does not drop these and instead marks them using small case. See: ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna/README
UPDATE: Using the "dna_sm" references from Ensembl the size of the index files is different, but the output index files also do not change during/after xenome index
starts outputting text to stdout.
Subsequently running xenome classify
also results in truncated fastq files as in the initial post, without returning any useful error messages. I'm at a loss here what is happening. Could this be related to the reference genome? Is xenome index
compatible with Ensemble reference genomes?
The problem is a concurrency issue. And it's a nasty one. This is why it works on short reference genomes: there isn't enough work for multithreading to be an issue.
I have a co-worker who has said he has run xenome without issue using human/mouse reference genomes (not sure exactly which ones) in the past (two years or so ago).
I've now tried running xenome index
with -T 1
and -T 8
, could it potentially work with other values of -T
?
Are you certain that this is a problem that is due to xenome index
and not xenome classify
? It seems so to me because there does not seem to be any change in the output files from xenome index
after the first 5 minutes of it running, even though it goes on for another hour or so.
Do you think it could work if I tried running this on a different server? Our cluster is running Linux XXX 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov 10 18:01:38 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
on centos-release-6-7.el6.centos.12.3.x86_64
Is there anything else I could do that may resolve this? Is this a bug with the program or could recompiling it fix the issue? If the former, is there any outlook for a fix?
Apologies for the many questions, thanks for your help!
"I've now tried running xenome index with -T 1 and -T 8, could it potentially work with other values of -T?"
If the bug is what we think it is, higher values should make it go faster and also increase the probability of xenome hanging.
"Are you certain that this is a problem that is due to xenome index and not xenome classify?"
Now that you mention it, no, I'm not certain. The only reason I said this is that nobody had reported an issue with classify yet, and we hadn't observed an issue with classify yet.
"Do you think it could work if I tried running this on a different server?"
It probably won't make a difference.
"Is this a bug with the program or could recompiling it fix the issue? If the former, is there any outlook for a fix?"
It's a bug. We are still working on it when we can. And, of course, we would eagerly accept patches!
Weird. A co-worker tipped me to uncompress the fastq (gunzip file.fq
) and try running xenome classify
again... To my surprise xenome now seems to work without issues. Hope this helps pinpoint the problem.
Same issue with xenome index
, decompressing FASTA files prior to running seems to allow the program to work as intended. It seems to me that xenome has issues dealing with gzipped fasta/fastq files. Perhaps this is an easier fix than finding some nasty concurrency issues of unknown origin?