diamond icon indicating copy to clipboard operation
diamond copied to clipboard

DNA SAM output

Open terrycojones opened this issue 6 years ago • 7 comments

Hi again Benjamin. Today I git pulled down the latest & saw the qqual output specifier - thanks again!

I'm wondering if you've considered supporting SAM output that has the input DNA sequences in it instead of the AA translations. I just ran with --outfmt 101 for the first time and was surprised to see the AA in there. I can understand why of course, and you have my sympathies for what must seem like continually being torn between the two worlds (e.g., what offsets should you emit, etc). I see the ZS tag in the SAM btw. Also I'm sure you know that putting AA into SAM isn't really supposed to be done, I think solely because you can't then convert that to BAM (I might have even learned that here in the comments on another ticket - not sure).

Anyway, I'm probably just going to do this myself as it's not very hard and I can see you have many people asking you things & probably have many other things to do. But I'm curious if you've thought about it or plan to do it?

Once I do it (talking myself into it here...) I'l comment back here and let others do the same via post-processing your output. I mainly only need to write a btop to CIGAR converter. Your qseq output attribute only gives the aligned part of the query so I'd probably just start/end the CIGAR string with hard-clipping when needed. OK, thinking out loud now, so I'll stop :-)

terrycojones avatar Nov 21 '18 23:11 terrycojones

In case this is of help to anyone, I wrote a BTOP to CIGAR converter in Python. See https://github.com/acorg/dark-matter/blob/master/dark/btop.py for the code and https://github.com/acorg/dark-matter/blob/master/test/test_btop.py for the tests. Could still be wrong but at least some not entirely trivial tests pass.

terrycojones avatar Nov 22 '18 16:11 terrycojones

There is now a diamond-to-sam.py script in the https://github.com/acorg/dark-matter package. I'm not 100% sure it's correct, but I would like to think so.... It differs from the DIAMOND --outfmt 101 output in that the sequences are DNA, they're reverse complimented when needed (with a reversed quality string, if present), the SAM flag is set to 16 if reverse complemented, the CIGAR string uses the more informative =/X indicators rather than just M. It currently only produces the AS extra SAM tag (DIAMOND produces quite a few more extra tags).

Thanks for the help @bbuchfink (the qqual field makes producing this much easier). This ticket could be closed now. I mainly opened it to see what you think and also to get it into search engines so people can find their way to an answer.

terrycojones avatar Nov 23 '18 14:11 terrycojones

BTW, is there a way to contribute to the documentation? It takes a long time to figure some things out. I could add them to the docs. E.g., I just discovered that the BTOP string is amino-acid based, so to make a DNA CIGAR string you have to multiply numbers by 3 and you have to use M (not X/=). And I worked out (the hard way) that the qseq sequence emitted in --outfmt 6 is already hard-clipped. There are other little things, like the SAM FLAG always being 0 even if the DNA sequence was reversed, etc. Anyway, maybe I could help to clear some things up.

terrycojones avatar Nov 23 '18 16:11 terrycojones

Hi Terry, I just added the latex sources to the repository under doc/, feel free to edit it and submit a pull request. Thanks for all your other feedback, I'll definitely look into it.

bbuchfink avatar Nov 24 '18 08:11 bbuchfink

Hi Benjamin. Great, thanks. Before I do anything, a couple of high-level questions. Do you want to keep making the docs via LaTeX -> PDF? I ask because it might make sense for the longer term and ease of editing / contributions to turn them into markdown. I'm not looking for extra things to do really, but it would be pretty easy to make a first version in markdown & see what it looks like. I'm also fine with LaTeX - I spent many years using it. Second, in either case, it will make submitting proposed changes simpler (on you) if we reformat to have shorter lines. That will make diffs much easier to read. You might be interested to read https://rhodesmill.org/brandon/2012/one-sentence-per-line/ So if you're happy for some initial reformatting, I'd make a pr which has no changes other than making shorter lines. If you were happy with that it would be an easy thing to merge (with a big diff, but not one needing much checking as no content would change). Thereafter we'd have small diffs.

Looks like you're a PhD student in Tübingen, is that still right? I'm in the process of moving to Berlin (institute of virology at Charité). I should organize a workshop or something & invite you to visit.

terrycojones avatar Nov 24 '18 14:11 terrycojones

I have been wanting to move to Markdown for some time actually, may as well do it now. I have converted the latex to markdown using pandoc and put it here: https://github.com/bbuchfink/diamond/wiki/Manual-Markdown-version You should be able to edit that. Not sure about the shorter lines but we can give it a try.

I'm still in Tübingen at MPI, and I'd sure love to visit Berlin some time.

bbuchfink avatar Nov 24 '18 14:11 bbuchfink

Back to your questions about the SAM format, it was pretty much only included for importing into MEGAN, which required the format to be like that. Looks like you already made a converter for an improved format, but of course it would also be possible to build these improvements into Diamond if need be (probably need to make them optional for backward compatibility).

bbuchfink avatar Nov 28 '18 12:11 bbuchfink