khmer icon indicating copy to clipboard operation
khmer copied to clipboard

Is sample-reads-randomly pair aware?

Open brooksph opened this issue 8 years ago • 4 comments

I'm using sample-reads-randomly.py with interleaved reads like so:

sample-reads-randomly.py SRR606249.pe.fq.gz

Once complete

running head SRR606249.pe.fq.gz.subset yields

@SRR606249.9162938.2 HWI-ST829:109:D073EACXX:2:1108:3193:135131 length=101
AAGGGGAGCCGGAGAGATACTTCTCACCAGCATCGACAGAGACGGTACGAAAATGGGATACGACACGGAGATGATCGAATTCGTAAGACCGCTGACGAACC
+
CCCFFFFFHHHHHJIJJJJJJJJJJJJJJJJJIJJIIJJJJJJJJGHHHH;?BACEEEDDDDDDDDDDDDBDDDEDDDDDDDDDDDDDDDDDDDDDD?BDB
@SRR606249.352354.2 HWI-ST829:109:D073EACXX:2:1101:16862:78751 length=101
AGCCGCAAGAGTTTCTGGTATAGGACACTCTTCTCCTCTAACATCTAAGACTTCTGTGGGCTTTTCGTTTTTCAATTTCTCTTCAATACTCATCATAATTT
+
CCCFFFFFHHHHHJJJJJHIJJJJHIIJJJJJIIIJJJJJJJIJJIGIIHEFIIJIIIJJJJJJJ;BEHHHFFFFFFEEEEEEEEDEEDDDDDDDDDDDEE
@SRR606249.47451110.2 HWI-ST829:109:D073EACXX:2:2303:11784:12273 length=101
CTACATATCTATCGTATAAAAGCTGTGTTTGTAAATAATACCCTTGGGCTTCTGGTTGAGCTACATCACCGTTATTGTCTTCTGCATATCCTAAATCCTTT

Then when I search for the r1 pair to "@SRR606249.9162938.2" using grep "9162938.1" SRR606249.pe.fq.gz.subset nothing is found. Yet this read can be found in the original file using grep "9162938.1" SRR606249.pe.fq which yields @SRR606249.9162938.1 HWI-ST829:109:D073EACXX:2:1108:3193:135131 length=101.

brooksph avatar Jul 28 '17 05:07 brooksph

I'm guessing this is a problem with the format of the headers, which are probably not being recognized as paired-end - the '.2' and '.1' are nonstandard. I'll have to look into it some more to be sure.

ctb avatar Jul 28 '17 14:07 ctb

Hi @brooksph! How was this reads file created? Was it downloaded or converted using the SRA toolkit?

I'm pretty sure this is the standard defline format for fastq-dump, which, if indeed true, hardly makes it non-standard. That is going to be the preferred/suggested method for many to download data from the SRA. I typically define a custom defline format when I run fastq-dump, but we probably want to add support for this. Shouldn't be too much work. :-)

standage avatar Aug 01 '17 18:08 standage

@standage Fastq dump! Thanks for the script.

brooksph avatar Aug 01 '17 18:08 brooksph

Oops, my fault. The IDs from @brooksph's example have 3 period-separated values:

  • X: accession
  • Y: serial number
  • Z: pairing info
SRR606249.9162938.2
    |        |    |
|---X---| |--Y--| Z

The default output of fastq-dump only has the first 2.

$ fastq-dump --split-files -Z SRR606249 | head -n 20
@SRR606249.1 HWI-ST829:109:D073EACXX:2:1101:1249:2134 length=101
GCAANNNNNNAACTTCCTCNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR606249.1 HWI-ST829:109:D073EACXX:2:1101:1249:2134 length=101
<<<@######22@@@?@?@##################################################################################
@SRR606249.1 HWI-ST829:109:D073EACXX:2:1101:1249:2134 length=101
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCAGCCTANNNNNNNNNNNNNNNNNNNCNTNNNNNNNNNNNNNNNNNNNN
+SRR606249.1 HWI-ST829:109:D073EACXX:2:1101:1249:2134 length=101
#####################################################################################################
@SRR606249.2 HWI-ST829:109:D073EACXX:2:1101:1165:2139 length=101
GTCNNNNNNAATTGGACGANNNNNNNNNNNNNNNNNATNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR606249.2 HWI-ST829:109:D073EACXX:2:1101:1165:2139 length=101
<<<######22@@@@??@?##################################################################################
@SRR606249.2 HWI-ST829:109:D073EACXX:2:1101:1165:2139 length=101
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGAGGGCAGNNNNNNNNNNNNNNNNNNNCNCNNNNNNNNNNNNNNNNNNNN
+SRR606249.2 HWI-ST829:109:D073EACXX:2:1101:1165:2139 length=101
#####################################################################################################
@SRR606249.3 HWI-ST829:109:D073EACXX:2:1101:1186:2164 length=101
TGATNNNNNTGTCTCACAGTNNNNNNNNNNNNNTATAANNNNNNNNNNNNNNNTAAAGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR606249.3 HWI-ST829:109:D073EACXX:2:1101:1186:2164 length=101
<<<@#####42@?@@?@?@@#################################################################################

So @titus was correct about the .1 and .2 suffixes being non-standard. Do you know how/when these got added in @brooksph?

standage avatar Aug 01 '17 18:08 standage