vdjtools icon indicating copy to clipboard operation
vdjtools copied to clipboard

ImmunoSeq parser is treating out-of-frame sequence as in-frame

Open malisas opened this issue 8 years ago • 1 comments

Hi, is the following ImmunoSeq-to-VDJtools conversion correct?

  1. I ran the command
vdjtools Convert -S ImmunoSeq -m metadata4Convert.txt ./samples/
  1. I am trying to understand how the ImmunoSeq parser is dealing with non-coding sequences. The top sequence in the first converted file is:
count	freq	cdr3nt	cdr3aa	v	d	j	VEnd	DStart	DEnd	JStart
2861	0.01262655239070375	TGCGCCAGCAGCCACCTCTCCACAGAGGCAATCAGCCCCAGCATTTTG	CASSHLSTEAISPSIL	TRBV4-3	TRBD1-1	TRBJ1-5	13	2124	27

The original sequence is actually out-of-frame (see below), but it appears to be translated fully by the vdjtools parser, without a stop (*) or frame-shift (not sure if this is ? or N) character. The "G" at the end of the cdr3nt sequence makes the cdr3nt length 48, but I believe it should not be included as part of the cdr3 sequence according to the original data.

  1. Here are the original lines from the ImmunoSeq file which correspond to the sequence above. The sequence is out-of-frame with a cdr3 length of 47 nt.
rearrangement	amino_acid	frame_type	rearrangement_type	templates	reads	frequency	productive_frequency	cdr3_length	v_family	v_gene	v_allele	d_family	d_gene	d_allele	j_family	j_gene
CACCCTGCAGCCAGAAGACTCGGCCCTGTATCTCTGCGCCAGCAGCCACCTCTCCACAGAGGCAATCAGCCCCAGCATTTTGGTGAT		Out	VDJ	2859	null	0.0126177257	null	47	TCRBV04	TCRBV04-03	1	TCRBD01	TCRBD01-01	1	TCRBJ01	TCRBJ01-05
ACACCCTGCAGCAGAAGACTCGGCCCTGTATCTCTGCGCCAGCAGCCACCTCTCCACAGAGGCAATCAGCCCCAGCATTTTGGTGAT		Out	VDJ	1	null	4.41333533404535E-06	null	47	TCRBV04	TCRBV04-03	1	TCRBD01	TCRBD01-01	1	TCRBJ01	TCRBJ01-05
ACACCTGCAGCCAGAAGACTCGGCCCTGTATCTCTGCGCCAGCAGCCACCTCTCCACAGAGGCAATCAGCCCCAGCATTTTGGTGAT		Out	VDJ	1	null	4.41333533404535E-06	null	47	TCRBV04	TCRBV04-03	1	TCRBD01	TCRBD01-01	1	TCRBJ01	TCRBJ01-05

When I run CalcBasicStats, is this sequence going to be defined as coding or non-coding? Thank you.

malisas avatar Nov 16 '17 00:11 malisas

Right, it should be marked as non-canonical and dropped (and it is out-of-frame although perfect translation).

The option to translate sequences was done on purpose, to check whether some sequences are really out-of-frame, not just incomplete reads (I think some of the issues here discuss this). I'll try to fix this.

mikessh avatar Nov 16 '17 06:11 mikessh