TRUST4 icon indicating copy to clipboard operation
TRUST4 copied to clipboard

How is `cid_full_length` assigned?

Open ejohnson643 opened this issue 9 months ago • 3 comments

So I've been trying to understand how the _report.tsv is generated from the _barcode_report.tsv and the _cdr3.out files in the trust-simplerep.pl script and I believe that I have correctly determined that the CDR3s counted in the _report.tsv are aggregated across barcodes using the V-D-J-C-CDRnt annotations as a unique key. This is all fine, but I was checking whether there were CDR3s in the report.tsv that have the same V-D-J-C and CDR amino acid annotations, to see how different the different CDR3s in the file are. As an example: Screenshot 2023-10-05 at 1 48 08 PM So we can see that there are 5 different CDR3s with identical V, J, and CDRaa, but which differ in the exact nucleotide sequences. However, when I look at whether these CDR3s are "full length," only the first one is, despite them all having nearly identical CDR3 nucleotide sequences: Screenshot 2023-10-05 at 1 50 30 PM Screenshot 2023-10-05 at 1 50 59 PM Could someone provide some insight into how exactly the "full length" determination is made and how it could show up differently in these elements of thereport.tsv? Thank you!

ejohnson643 avatar Oct 05 '23 17:10 ejohnson643

The full length means the underlying contig contains the full-length receptor variable domain: 5' of V gene to the 3' of J gene. It is more strict than the complete CDR3.

mourisl avatar Oct 05 '23 18:10 mourisl

Thanks for your help!

To make sure I understand: cid_full_length is a property of the contig, not necessarily of the CDR3.

The idea behind indicating this information in the _report.tsv is that it let's the user know whether the CDR3 was generated from a contig that contains the ends of the V and J gene?

ejohnson643 avatar Oct 05 '23 18:10 ejohnson643

Exactly.

mourisl avatar Oct 05 '23 22:10 mourisl