HiCUP icon indicating copy to clipboard operation
HiCUP copied to clipboard

NlaIII digestion problem

Open StevenWingett opened this issue 5 years ago • 3 comments

We used NlaIII enzyme to digest in our Hi-C. I specified --re1 CATG^,NlaIII when I run hicup_digester, the result file seems good, here shows the head of the file.

But when I run HICUP with it, it gives no results, from the log I found there is no sequence in [].

Truncating with HiCUP Truncater v0.7.4 Truncating sequences at occurrence of sequences '[]' Truncating sequences

StevenWingett avatar Sep 29 '20 16:09 StevenWingett

I had the same issue, and I think it's because you need to provide the dangling sequence too on the otherside of the caret:

https://en.wikipedia.org/wiki/NlaIII

so I got something to print, when I used --re1 CATG^CATG,NlaIII

Truncating with HiCUP Truncater v0.8.3
Truncating sequences at occurrence of sequences '[CATGCATG]'
Truncating sequences
Truncating R1_fq.gz
Truncating R2_fq.gz

Edit: ignore this comment, see updated issue below

mtekman avatar Dec 08 '23 12:12 mtekman

After more playing around I realise that "CATG^" should actually work, and that the sequences being looked for should just be "CATG" and not "CATGCATG" or "CATGGTAC" or any other.

Currently the truncated file is completely improperly truncated:

If I have a sequence

I'll call this file test1.fq

@A00627:719:H7LLYDSX7:3:1101:20003:4914 1:N:0:ATCACG
ACCTAAAGCTTTACTACAGAGCAATTGTGATAAAAACTGCATGGTACTGGTATAGAGACAGACAAGTAGACCAATGGACT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF,F
@A00627:719:H7LLYDSX7:3:1101:19208:8202 1:N:0:ATCACG
AGAAAGAAAGAAAGAAAGAAACTCGTTTCTCTGAGATGTAGGCCATGGTACCTGACAGTTTAAAATTGAAACAAACAAAGACACAAGGAAGTGTGGGTGGGGT
+
FFFFFFFFFFF:FFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:4616:9267 1:N:0:ATCACG
AGCTACAAGGTCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAATGAATATGAATCATGGTACCTGAAGCATATCTTGCAATTTACAATCATATACAGAAATTAAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFF:FFFFFFFFFFFF:FFF
@A00627:719:H7LLYDSX7:3:1101:9263:9580 1:N:0:ATCACG
ATACGTAGCCCAAGCTAGCTACAATCTCAAGATCCTCCTGCTTCAGCCTCCTGGGTGCTAGGATTACAGGCATGGTACCTTATCC
+
FFF,,FF,F,:FF,FFFFFF,,FFFFFFFFFFF,:F,FFF,:F:F:,FF,:F:FFFFFF:F,F:,:F:,F,,F,,FF,FF::FF:

How CATG^GTAC is truncated

rm -rf test_dir; mkdir test;
hicup_truncater --re1 "CATG^GTAC"  test1.fq test1.fq  ## just write it twice for testing

yields:

"Truncating sequences at occurrence of sequences '[CATGGTAC]'"

@A00627:719:H7LLYDSX7:3:1101:20003:4914 1:N:0:ATCACG
ACCTAAAGCTTTACTACAGAGCAATTGTGATAAAAACTGCATGGTAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F
@A00627:719:H7LLYDSX7:3:1101:19208:8202 1:N:0:ATCACG
AGAAAGAAAGAAAGAAAGAAACTCGTTTCTCTGAGATGTAGGCCATGGTAC
+
FFFFFFFFFFF:FFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:4616:9267 1:N:0:ATCACG
AGCTACAAGGTCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAATGAATATGAATCATGGTAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:9263:9580 1:N:0:ATCACG
ATACGTAGCCCAAGCTAGCTACAATCTCAAGATCCTCCTGCTTCAGCCTCCTGGGTGCTAGGATTACAGGCATGGTAC
+
FFF,,FF,F,:FF,FFFFFF,,FFFFFFFFFFF,:F,FFF,:F:F:,FF,:F:FFFFFF:F,F:,:F:,F,,F,,FF,

Note that each read which matches "CATGGTAC" is cut, and the sequence ends with it

How CATG^ is truncated:

rm -rf test_dir; mkdir test;
hicup_truncater --re1 "CATG^"  test1.fq test1.fq

yields:

"Truncating sequences at occurrence of sequences '[]'"

@A00627:719:H7LLYDSX7:3:1101:20003:4914 1:N:0:ATCACG
ACATG
+
FFFFF
@A00627:719:H7LLYDSX7:3:1101:19208:8202 1:N:0:ATCACG
ACATG
+
FFFFF
@A00627:719:H7LLYDSX7:3:1101:4616:9267 1:N:0:ATCACG
ACATG
+
FFFFF
@A00627:719:H7LLYDSX7:3:1101:9263:9580 1:N:0:ATCACG
ACATG
+
FFF,,

Note how the read has basically just vanished. This is wrong

What CATG^ should be producing

rm -rf test_dir; mkdir test;
hicup_truncater --re1 "CATG^"  test1.fq test1.fq

"Truncating sequences at occurrence of sequences '[CATG]'"

@A00627:719:H7LLYDSX7:3:1101:20003:4914 1:N:0:ATCACG
ACCTAAAGCTTTACTACAGAGCAATTGTGATAAAAACTGCATG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:19208:8202 1:N:0:ATCACG
AGAAAGAAAGAAAGAAAGAAACTCGTTTCTCTGAGATGTAGGCCATG
+
FFFFFFFFFFF:FFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:4616:9267 1:N:0:ATCACG
AGCTACAAGGTCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAATGAATATGAATCATG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF
@A00627:719:H7LLYDSX7:3:1101:9263:9580 1:N:0:ATCACG
ATACGTAGCCCAAGCTAGCTACAATCTCAAGATCCTCCTGCTTCAGCCTCCTGGGTGCTAGGATTACAGGCATG
+
FFF,,FF,F,:FF,FFFFFF,,FFFFFFFFFFF,:F,FFF,:F:F:,FF,:F:FFFFFF:F,F:,:F:,F,,F,

Note how each read is truncated by ends with the desired sequence.

mtekman avatar Jan 30 '24 13:01 mtekman

This fix is implemented in PR above

mtekman avatar Jan 30 '24 13:01 mtekman