khmer icon indicating copy to clipboard operation
khmer copied to clipboard

interleave-reads.py fails with differing metadata for read pairs

Open fungs opened this issue 6 years ago • 8 comments

The script, if fed with read pairs which look like this

read1 length=99/1
read1 length=101/2

fails. This should be fixed by looking at the first part of the identifier separated by whitespace, which seems to be common Illumina format.

Message: ERROR: This doesn't look like paired data! Version: 2.1.2 Executable: interleave-reads.py

Best, Johannes

fungs avatar Apr 12 '18 17:04 fungs

Thanks for the report!

The issue isn't the read lengths themselves, its the read IDs. At the moment khmer only recognizes paired reads if the entire read ID (including stuff after the first whitespace) is identical up to the /1 or /2 suffix. I'm not convinced this is a bug.

This isn't the only pairing format that khmer supports, but if you are relying on the /1 and /2 suffixes then everything prior to this suffix must be identical.

standage avatar Apr 12 '18 18:04 standage

Well, I know there are different naming schemes but if you implement sanity checking for Illumina-style naming you should know that many post-processing tools append key-value tags to the read names but the part before the first whitespace is always unique and the rest is just metainfo. I'm not aware of a program creating identifiers which include whitespaces. So it would be much more robust to the if the first part matches instead of simply cutting of the /1 and /2 parts. Or you can argue with Brian Bushnell over at BBMap whether adjusting the length= tag to the new length of the read really is a bug (though my gut feeling tells me that not adjusting it would be wrong).

The length=xx tag is standard when you pull sequences from NCBI SRA using fastq-dump, this is where my reads came from.

This is just my 2 cents, I have my own working alternative and BBMap has its own which I guess works just fine (reformat.sh), I was just giving the khmer scripts a try.

Best, Johannes

fungs avatar Apr 12 '18 18:04 fungs

One of the alternative conventions khmer supports (or is supposed to support) is @name 1:blah blah blah and @name 2:blah blah blah, which is compatible with appending metadata to the defline ad infinitum.

standage avatar Apr 12 '18 20:04 standage

That doesn't change the checking rule I mentioned. You don't really need to check 1 vs 2, just make sure the identifiers before the whitespace match.

fungs avatar Apr 12 '18 20:04 fungs

I'll poke some of my colleagues and see if anyone else wants to chime in. As a new user of khmer, I remember the handling of paired reads as one of my first pain points. It should be fairly straightforward to support any number of reasonable conventions as long as they're not too obscure or complex.

standage avatar Apr 12 '18 20:04 standage

I used to have a similar problem when I got Fastq files from SRA using fastq-dump. They have an option to append .1 and .2 to the @name part of the headers but this is still not accepted by our parser. I had to edit them to /1 and /2. I think we can have an option like force_paired that does not check for broken reads and just believe that reads are paired.

drtamermansour avatar Apr 12 '18 22:04 drtamermansour

note that for the 2.x series, we can only add command line flags (ref Semantic Versioning) to adjust behavior. I think @drtamermansour proposal is good for that.

however, for 3.x, we could completely change the behavior, and for that I would propose by default checking that the identifier before the white space is correct and ignoring everything after that and ripping out anything else that checks for proper pairing. There are too many conventions to support to do otherwise and there's a major speed consideration also.

thoughts?

ctb avatar Apr 13 '18 13:04 ctb

@ctb: sounds reasonable to me. It's rather unlikely that two reads in the forward and reverse files have the same identifier and index but should not be properly paired.

If you want to go nuts on format checking, you could have a command line option for strong consistency checking which understands some formats and might check for uniqueness of identifiers as well. However, following the UNIX one-program-per-task paradigm that should be in a separate program.

fungs avatar Apr 13 '18 13:04 fungs