TRUST4 icon indicating copy to clipboard operation
TRUST4 copied to clipboard

Differences in counts compared to MIXCR results, and out-of-frame CDR3 handling

Open dcarbajo opened this issue 1 year ago • 8 comments

Hello again! This is a follow-up to issue #247, thank you so much for your insights there!

So I managed to run TRUST4 on my SMARTer data with the following command (which still took really too long, like half a day per sample):

run-trust4 --barcodeLevel molecule
                   -f path_to/hg38_bcrtcr.fa
                   --ref path_to/human_IMGT+C.fa
                   -1 path_to/sample_fq1
                   -2 path_to/sample_fq2
                   --barcode path_to/sample_fq2
                   --readFormat bc:0:11,r2:20:-1,r1:28:-1
                   --repseq -o sample_name --od sample_output_dir -t 8

But now that I compare one sample results with the ones previously obtained with MIXCR for that sample, I observe some discrepancies I was hoping you could help me understand.

At a glance, the things that strike me the most are the number of clonotype entries in the MIXCR report compared to the TRUST4 one. While the MIXCR file has 4320 lines, the TRUST4 one has 82024, though filtering out to TRA entries only, I come down to 27637 lines (6423 without singleton clonotypes, with count=1, so still over 2000 more clonotypes found).

Then the counts seem quite different; see for example the top clonotype (TRAV-21 / TRAJ31) with a count of 361655 in MIXCR:

Screenshot 2024-02-20 at 13 26 14

The same clonotype in the TRUST4 report, although still at the top, has a count of just 7290, two orders of magnitude less:

Screenshot 2024-02-20 at 13 28 00

So there are a lot more clonotypes in the TRUST4 report compared to the MIXCR one, but I wanted to see which clonotypes found by MIXCR were not recovered by TRUST4.

What I observed is that most of these cases contain a CDR3 sequence with gap/s in MIXCR, which might be due to an out-of-frame CDR3. All these cases are one line in the MIXCR output, but several lines in the TRUST4 one...

I extracted the "V" and "J" from these clonotypes with gaps in MIXCR, and subsetted both outputs for a few examples. Check the example below:

while the subset is just one line in the MIXCR output:

Screenshot 2024-02-20 at 13 35 50

it becomes several different lines in the TRUST4 output:

Screenshot 2024-02-20 at 13 36 09

Strangely, all these CDR3 sequences are quite different, and there are some that aren't real ones (like FEASIRDENIIF above) which concerns me a bit. Most of these entries belong to singleton clonotypes, but not all (the top 4 lines have count>1).

I was wondering how to interpret this, and whether there is some aggregation or filtering that I should do downstream of TRUST4, to make the results more comprehensible (and comparable to the previous I obtained with MIXCR).

Many thanks again!

dcarbajo avatar Feb 20 '24 05:02 dcarbajo