TRUST4
TRUST4 copied to clipboard
Differences in counts compared to MIXCR results, and out-of-frame CDR3 handling
Hello again! This is a follow-up to issue #247, thank you so much for your insights there!
So I managed to run TRUST4 on my SMARTer data with the following command (which still took really too long, like half a day per sample):
run-trust4 --barcodeLevel molecule
-f path_to/hg38_bcrtcr.fa
--ref path_to/human_IMGT+C.fa
-1 path_to/sample_fq1
-2 path_to/sample_fq2
--barcode path_to/sample_fq2
--readFormat bc:0:11,r2:20:-1,r1:28:-1
--repseq -o sample_name --od sample_output_dir -t 8
But now that I compare one sample results with the ones previously obtained with MIXCR for that sample, I observe some discrepancies I was hoping you could help me understand.
At a glance, the things that strike me the most are the number of clonotype entries in the MIXCR report compared to the TRUST4 one. While the MIXCR file has 4320
lines, the TRUST4 one has 82024
, though filtering out to TRA entries only, I come down to 27637
lines (6423
without singleton clonotypes, with count=1, so still over 2000 more clonotypes found).
Then the counts seem quite different; see for example the top clonotype (TRAV-21 / TRAJ31
) with a count of 361655
in MIXCR:
The same clonotype in the TRUST4 report, although still at the top, has a count of just 7290
, two orders of magnitude less:
So there are a lot more clonotypes in the TRUST4 report compared to the MIXCR one, but I wanted to see which clonotypes found by MIXCR were not recovered by TRUST4.
What I observed is that most of these cases contain a CDR3 sequence with gap/s in MIXCR, which might be due to an out-of-frame CDR3. All these cases are one line in the MIXCR output, but several lines in the TRUST4 one...
I extracted the "V" and "J" from these clonotypes with gaps in MIXCR, and subsetted both outputs for a few examples. Check the example below:
while the subset is just one line in the MIXCR output:
it becomes several different lines in the TRUST4 output:
Strangely, all these CDR3 sequences are quite different, and there are some that aren't real ones (like FEASIRDENIIF
above) which concerns me a bit. Most of these entries belong to singleton clonotypes, but not all (the top 4 lines have count>1).
I was wondering how to interpret this, and whether there is some aggregation or filtering that I should do downstream of TRUST4, to make the results more comprehensible (and comparable to the previous I obtained with MIXCR).
Many thanks again!