ImmunoSeq parser is treating out-of-frame sequence as in-frame
Hi, is the following ImmunoSeq-to-VDJtools conversion correct?
- I ran the command
vdjtools Convert -S ImmunoSeq -m metadata4Convert.txt ./samples/
- I am trying to understand how the ImmunoSeq parser is dealing with non-coding sequences. The top sequence in the first converted file is:
count freq cdr3nt cdr3aa v d j VEnd DStart DEnd JStart
2861 0.01262655239070375 TGCGCCAGCAGCCACCTCTCCACAGAGGCAATCAGCCCCAGCATTTTG CASSHLSTEAISPSIL TRBV4-3 TRBD1-1 TRBJ1-5 13 2124 27
The original sequence is actually out-of-frame (see below), but it appears to be translated fully by the vdjtools parser, without a stop (*) or frame-shift (not sure if this is ? or N) character. The "G" at the end of the cdr3nt sequence makes the cdr3nt length 48, but I believe it should not be included as part of the cdr3 sequence according to the original data.
- Here are the original lines from the ImmunoSeq file which correspond to the sequence above. The sequence is out-of-frame with a cdr3 length of 47 nt.
rearrangement amino_acid frame_type rearrangement_type templates reads frequency productive_frequency cdr3_length v_family v_gene v_allele d_family d_gene d_allele j_family j_gene
CACCCTGCAGCCAGAAGACTCGGCCCTGTATCTCTGCGCCAGCAGCCACCTCTCCACAGAGGCAATCAGCCCCAGCATTTTGGTGAT Out VDJ 2859 null 0.0126177257 null 47 TCRBV04 TCRBV04-03 1 TCRBD01 TCRBD01-01 1 TCRBJ01 TCRBJ01-05
ACACCCTGCAGCAGAAGACTCGGCCCTGTATCTCTGCGCCAGCAGCCACCTCTCCACAGAGGCAATCAGCCCCAGCATTTTGGTGAT Out VDJ 1 null 4.41333533404535E-06 null 47 TCRBV04 TCRBV04-03 1 TCRBD01 TCRBD01-01 1 TCRBJ01 TCRBJ01-05
ACACCTGCAGCCAGAAGACTCGGCCCTGTATCTCTGCGCCAGCAGCCACCTCTCCACAGAGGCAATCAGCCCCAGCATTTTGGTGAT Out VDJ 1 null 4.41333533404535E-06 null 47 TCRBV04 TCRBV04-03 1 TCRBD01 TCRBD01-01 1 TCRBJ01 TCRBJ01-05
When I run CalcBasicStats, is this sequence going to be defined as coding or non-coding? Thank you.
Right, it should be marked as non-canonical and dropped (and it is out-of-frame although perfect translation).
The option to translate sequences was done on purpose, to check whether some sequences are really out-of-frame, not just incomplete reads (I think some of the issues here discuss this). I'll try to fix this.