HiCUP
HiCUP copied to clipboard
NlaIII digestion problem
We used NlaIII enzyme to digest in our Hi-C. I specified --re1 CATG^,NlaIII when I run hicup_digester, the result file seems good, here shows the head of the file.
But when I run HICUP with it, it gives no results, from the log I found there is no sequence in [].
Truncating with HiCUP Truncater v0.7.4 Truncating sequences at occurrence of sequences '[]' Truncating sequences
I had the same issue, and I think it's because you need to provide the dangling sequence too on the otherside of the caret:
https://en.wikipedia.org/wiki/NlaIII
so I got something to print, when I used --re1 CATG^CATG,NlaIII
Truncating with HiCUP Truncater v0.8.3
Truncating sequences at occurrence of sequences '[CATGCATG]'
Truncating sequences
Truncating R1_fq.gz
Truncating R2_fq.gz
Edit: ignore this comment, see updated issue below
After more playing around I realise that "CATG^" should actually work, and that the sequences being looked for should just be "CATG" and not "CATGCATG" or "CATGGTAC" or any other.
Currently the truncated file is completely improperly truncated:
If I have a sequence
I'll call this file test1.fq
@A00627:719:H7LLYDSX7:3:1101:20003:4914 1:N:0:ATCACG
ACCTAAAGCTTTACTACAGAGCAATTGTGATAAAAACTGCATGGTACTGGTATAGAGACAGACAAGTAGACCAATGGACT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF,F
@A00627:719:H7LLYDSX7:3:1101:19208:8202 1:N:0:ATCACG
AGAAAGAAAGAAAGAAAGAAACTCGTTTCTCTGAGATGTAGGCCATGGTACCTGACAGTTTAAAATTGAAACAAACAAAGACACAAGGAAGTGTGGGTGGGGT
+
FFFFFFFFFFF:FFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:4616:9267 1:N:0:ATCACG
AGCTACAAGGTCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAATGAATATGAATCATGGTACCTGAAGCATATCTTGCAATTTACAATCATATACAGAAATTAAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFF:FFFFFFFFFFFF:FFF
@A00627:719:H7LLYDSX7:3:1101:9263:9580 1:N:0:ATCACG
ATACGTAGCCCAAGCTAGCTACAATCTCAAGATCCTCCTGCTTCAGCCTCCTGGGTGCTAGGATTACAGGCATGGTACCTTATCC
+
FFF,,FF,F,:FF,FFFFFF,,FFFFFFFFFFF,:F,FFF,:F:F:,FF,:F:FFFFFF:F,F:,:F:,F,,F,,FF,FF::FF:
How CATG^GTAC is truncated
rm -rf test_dir; mkdir test;
hicup_truncater --re1 "CATG^GTAC" test1.fq test1.fq ## just write it twice for testing
yields:
"Truncating sequences at occurrence of sequences '[CATGGTAC]'"
@A00627:719:H7LLYDSX7:3:1101:20003:4914 1:N:0:ATCACG
ACCTAAAGCTTTACTACAGAGCAATTGTGATAAAAACTGCATGGTAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F
@A00627:719:H7LLYDSX7:3:1101:19208:8202 1:N:0:ATCACG
AGAAAGAAAGAAAGAAAGAAACTCGTTTCTCTGAGATGTAGGCCATGGTAC
+
FFFFFFFFFFF:FFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:4616:9267 1:N:0:ATCACG
AGCTACAAGGTCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAATGAATATGAATCATGGTAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:9263:9580 1:N:0:ATCACG
ATACGTAGCCCAAGCTAGCTACAATCTCAAGATCCTCCTGCTTCAGCCTCCTGGGTGCTAGGATTACAGGCATGGTAC
+
FFF,,FF,F,:FF,FFFFFF,,FFFFFFFFFFF,:F,FFF,:F:F:,FF,:F:FFFFFF:F,F:,:F:,F,,F,,FF,
Note that each read which matches "CATGGTAC" is cut, and the sequence ends with it
How CATG^ is truncated:
rm -rf test_dir; mkdir test;
hicup_truncater --re1 "CATG^" test1.fq test1.fq
yields:
"Truncating sequences at occurrence of sequences '[]'"
@A00627:719:H7LLYDSX7:3:1101:20003:4914 1:N:0:ATCACG
ACATG
+
FFFFF
@A00627:719:H7LLYDSX7:3:1101:19208:8202 1:N:0:ATCACG
ACATG
+
FFFFF
@A00627:719:H7LLYDSX7:3:1101:4616:9267 1:N:0:ATCACG
ACATG
+
FFFFF
@A00627:719:H7LLYDSX7:3:1101:9263:9580 1:N:0:ATCACG
ACATG
+
FFF,,
Note how the read has basically just vanished. This is wrong
What CATG^ should be producing
rm -rf test_dir; mkdir test;
hicup_truncater --re1 "CATG^" test1.fq test1.fq
"Truncating sequences at occurrence of sequences '[CATG]'"
@A00627:719:H7LLYDSX7:3:1101:20003:4914 1:N:0:ATCACG
ACCTAAAGCTTTACTACAGAGCAATTGTGATAAAAACTGCATG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:19208:8202 1:N:0:ATCACG
AGAAAGAAAGAAAGAAAGAAACTCGTTTCTCTGAGATGTAGGCCATG
+
FFFFFFFFFFF:FFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:4616:9267 1:N:0:ATCACG
AGCTACAAGGTCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAATGAATATGAATCATG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:9263:9580 1:N:0:ATCACG
ATACGTAGCCCAAGCTAGCTACAATCTCAAGATCCTCCTGCTTCAGCCTCCTGGGTGCTAGGATTACAGGCATG
+
FFF,,FF,F,:FF,FFFFFF,,FFFFFFFFFFF,:F,FFF,:F:F:,FF,:F:FFFFFF:F,F:,:F:,F,,F,
Note how each read is truncated by ends with the desired sequence.
This fix is implemented in PR above