banzai icon indicating copy to clipboard operation
banzai copied to clipboard

derep_fasta_hash.py appears to truncate fasta headers at 45 characters

Open invertdna opened this issue 8 years ago • 6 comments

  1. I love your work. And your stylish Polynesian beverages.

  2. The new hash option for dereplicating is working great -- esp. useful for combining datasets. But...

  3. the hashes are long enough to get truncated by the script somehow. This wasn't a problem in the non-hash version of the derep script: you don’t notice when the header is OTU1, for example, because it’s short. But in my current pipeline output, dups_to_otus has untruncated fasta headers, and derep.map has truncated headers, and so the two don't match up when it comes time to create an OTU.mapfile, for example. Any chance you know why this happens? I don't speak enough python to quite figure it out.

invertdna avatar Nov 28 '17 23:11 invertdna

Weird. Is this the first time using the hashing option for dereplication? Or has it been used successfully, and now it's breaking?

jimmyodonnell avatar Nov 29 '17 13:11 jimmyodonnell

No, it was that way before, too : it’s only become a problem now (I was working around it earlier). It’s not that big a deal, but would be good to deal with if it’s an easy fix.

Example:

derep.map [showing truncation] SHA1=d9187e8a6c7b2f0702404a6d373daf455f4e6ee4 ID1=Lib_F;ID2A=ATATCG;ID2B=CGATAT 45333 SHA1=d9187e8a6c7b2f0702404a6d373daf455f4e6ee4 ID1=Lib_F;ID2A=CTCGCA;ID2B=TGCGAG 32859

OTUs.fasta [no truncation]

SHA1=d9187e8a6c7b2f0702404a6d373daf455f4e6ee4;size=1151502; aaaaagatgctgaaaaagaacaggatctccaccaccatggctgtcaaagaaagcggtgttgaaatttctgtctgttagaagcatagtaattgctcctgctaaaacaggtaacgataataagagtaaaaaagctgtaataaaaacagctcaaacaaataaaggtaccctatgtgctgacataccgggtgctctcatgtttaaaatcgttacaataaaatttatcgctcctaaaatagaagaagcacctgcaatgtgtaaactaaaaattgcaagatctacagatcctccagaatgtgctaggattccacttaaa SHA1=525d7b3452e357ad771ff3fe71311c95e5ed9a1f;size=1053571; aaacaaatgttgatataatactgggtcacctccaccattagggtcgaagaatgacgtattaaagttacgatctgttaaaagcattgtaatagctcctgctagaaccggtaaagacaataaaagtaagaaagctgtaataaatacagctcatacaaataaaggagttcggtgagcagtcattcctggagctcgcatgttcataattgttacaataaaattaattgcaccaagaatcgacgaaactccggctaaatgaagagagaaaattgctaaatcaacagaacctcctgaatgagcttgaatccctgctaga

dups_to_otus.csv [no truncation] Query,Match SHA1=d9187e8a6c7b2f0702404a6d373daf455f4e6ee4;size=676598,SHA1=d9187e8a6c7b2f0702404a6d373daf455f4e6ee4;size=676598 SHA1=33a3d43fe71f820b1b0fea0053b7c8888fa4e98b;size=4224,SHA1=d9187e8a6c7b2f0702404a6d373daf455f4e6ee4;size=676598

On Nov 29, 2017, at 5:49 AM, jimmyodonnell [email protected] wrote:

Weird. Is this the first time using the hashing option for dereplication? Or has it been used successfully, and now it's breaking?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jimmyodonnell/banzai/issues/17#issuecomment-347865495, or mute the thread https://github.com/notifications/unsubscribe-auth/AOHDgIDQvSHl9LkqURufqRMG6NexOESGks5s7WEBgaJpZM4QuNJB.

invertdna avatar Nov 29 '17 16:11 invertdna

It doesn't look like it's truncating because of length; it just doesn't include ;size=***.

When I run bash banzai.sh test without hashing, I get a dups_to_otus.csv that looks like this:

Query,Match
DUP_1,DUP_1
DUP_22,DUP_1
DUP_40,DUP_1
DUP_59,DUP_1
DUP_60,DUP_1

and when I run it with hashing, I get this:

Query,Match
SHA1=432ed9add9b91f058aba9930ed2119b83be066b1,SHA1=432ed9add9b91f058aba9930ed2119b83be066b1
SHA1=57baf2651a3c1bedc4fe2d913b7967a421afb7bf,SHA1=432ed9add9b91f058aba9930ed2119b83be066b1
SHA1=e096a61542344503fc6c3d872ff77019165a1da3,SHA1=432ed9add9b91f058aba9930ed2119b83be066b1
SHA1=9443e1324191a63d5b8d4166619b922ad90a8432,SHA1=432ed9add9b91f058aba9930ed2119b83be066b1
SHA1=a6d5b6877016bab0c48f274b1ada16e1ac72fda2,SHA1=432ed9add9b91f058aba9930ed2119b83be066b1

There is no ;size= info in either file. Are you are using a modified version of banzai?

jimmyodonnell avatar Nov 29 '17 17:11 jimmyodonnell

Hmm, I thought I looked at that a few months ago — let me take another look. Sorry for the false alarm. Trying to figure out how to get an OTU.map (the R script breaks at the merge() function in trying to merge the dups_to_otus and the derep.map, so if it’s not truncation, I guess it’s because one has the “;size=“ in it and the other doesn’t).

On Nov 29, 2017, at 9:38 AM, jimmyodonnell [email protected] wrote:

It doesn't look like it's truncating because of length; it just doesn't include ;size=***.

When I run bash banzai.sh test without hashing, I get a dups_to_otus.csv that looks like this:

Query,Match DUP_1,DUP_1 DUP_22,DUP_1 DUP_40,DUP_1 DUP_59,DUP_1 DUP_60,DUP_1 and when I run it with hashing, I get this:

Query,Match SHA1=432ed9add9b91f058aba9930ed2119b83be066b1,SHA1=432ed9add9b91f058aba9930ed2119b83be066b1 SHA1=57baf2651a3c1bedc4fe2d913b7967a421afb7bf,SHA1=432ed9add9b91f058aba9930ed2119b83be066b1 SHA1=e096a61542344503fc6c3d872ff77019165a1da3,SHA1=432ed9add9b91f058aba9930ed2119b83be066b1 SHA1=9443e1324191a63d5b8d4166619b922ad90a8432,SHA1=432ed9add9b91f058aba9930ed2119b83be066b1 SHA1=a6d5b6877016bab0c48f274b1ada16e1ac72fda2,SHA1=432ed9add9b91f058aba9930ed2119b83be066b1 There is no ;size= info in either file. Are you are using a modified version of banzai?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jimmyodonnell/banzai/issues/17#issuecomment-347937512, or mute the thread https://github.com/notifications/unsubscribe-auth/AOHDgOEPMWqompTWuoXKs1d6yG4B7BTMks5s7ZaKgaJpZM4QuNJB.

invertdna avatar Nov 29 '17 17:11 invertdna

The file OTU.map looks as expected for both unhashed and hashed sequence IDs for me (below). I take it you're using another version of at least some component of this?

DUP_1	ID1=A;ID2A=ATCAGT;ID2B=ACTGAT	28
DUP_1	ID1=E;ID2A=ATCAGT;ID2B=ACTGAT	22
DUP_1	ID1=D;ID2A=TGTATG;ID2B=CATACA	21
DUP_1	ID1=A;ID2A=TGTATG;ID2B=CATACA	25
DUP_1	ID1=B;ID2A=TACGTG;ID2B=CACGTA	20
SHA1=0a055f4b00298e326bcf5cad348bc2b5a1c05d5a	ID1=F;ID2A=TCTGCG;ID2B=CGCAGA	1
SHA1=0a055f4b00298e326bcf5cad348bc2b5a1c05d5a	ID1=F;ID2A=ACGACG;ID2B=CGTCGT	1
SHA1=0a055f4b00298e326bcf5cad348bc2b5a1c05d5a	ID1=K;ID2A=TCTGCG;ID2B=CGCAGA	1
SHA1=0c7e1bbe801c6b2c14e802a4422fc775a99baf36	ID1=F;ID2A=ATCAGT;ID2B=TACTGA	1
SHA1=0c7e1bbe801c6b2c14e802a4422fc775a99baf36	ID1=D;ID2A=ATATCG;ID2B=CGATAT	1

jimmyodonnell avatar Nov 29 '17 17:11 jimmyodonnell

I didn’t think so, but will re-download and try again. Sorry!

On Nov 29, 2017, at 9:51 AM, jimmyodonnell [email protected] wrote:

The file OTU.map looks as expected for both unhashed and hashed sequence IDs for me (below). I take it you're using another version of at least some component of this?

DUP_1 ID1=A;ID2A=ATCAGT;ID2B=ACTGAT 28 DUP_1 ID1=E;ID2A=ATCAGT;ID2B=ACTGAT 22 DUP_1 ID1=D;ID2A=TGTATG;ID2B=CATACA 21 DUP_1 ID1=A;ID2A=TGTATG;ID2B=CATACA 25 DUP_1 ID1=B;ID2A=TACGTG;ID2B=CACGTA 20 SHA1=0a055f4b00298e326bcf5cad348bc2b5a1c05d5a ID1=F;ID2A=TCTGCG;ID2B=CGCAGA 1 SHA1=0a055f4b00298e326bcf5cad348bc2b5a1c05d5a ID1=F;ID2A=ACGACG;ID2B=CGTCGT 1 SHA1=0a055f4b00298e326bcf5cad348bc2b5a1c05d5a ID1=K;ID2A=TCTGCG;ID2B=CGCAGA 1 SHA1=0c7e1bbe801c6b2c14e802a4422fc775a99baf36 ID1=F;ID2A=ATCAGT;ID2B=TACTGA 1 SHA1=0c7e1bbe801c6b2c14e802a4422fc775a99baf36 ID1=D;ID2A=ATATCG;ID2B=CGATAT 1 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jimmyodonnell/banzai/issues/17#issuecomment-347941322, or mute the thread https://github.com/notifications/unsubscribe-auth/AOHDgDFbrZXKCigDH8IBUZFY7nHijtIaks5s7ZmhgaJpZM4QuNJB.

invertdna avatar Nov 29 '17 17:11 invertdna