TRUST4 icon indicating copy to clipboard operation
TRUST4 copied to clipboard

'report.csv' file as input to VDJtools error

Open wangjiangyuan opened this issue 3 years ago • 15 comments

HI: I have analyzed a simple data and got a result 'work_report.csv' and then I use VDJtools to analyzed, there is an error. I think it is 'out_of_frame' cause this error. Is there any good way? Thank you Jiangyuan

$VDJTOOLS CalcBasicStats -m metadata.txt out/0 Executing com.antigenomics.vdjtools.basic.CalcBasicStats -m metadata.txt out/0 [Wed Jun 09 18:18:07 CST 2021 CalcBasicStats] Reading sample(s) [Wed Jun 09 18:18:07 CST 2021 CalcBasicStats] 126 sample(s) prepared [Wed Jun 09 18:18:07 CST 2021 SampleStreamConnection] Loading sample N001 [WARNING] Some of the essential fields are bad/missing for the following clonotype string (displaying first 5 warnings) NO_J: 3 1.000000e+00 TGCACCACCATGCCAACTAATTTTTTTAAAAATTTTTTT CTTMPTNFFKNFF TRAV8-101 . . . assemble2674 0 [ERROR] java.lang.RuntimeException: Unable to parse clonotype string 2 1.000000e+00 TGCGCCTAGCCCCACCTTGCTGTTTTTT out_of_frame TRBV5-101 . TRBJ2-2*01 TRBC assemble14924 0 for VDJtools input type: Unknown symbol "o", see _vdjtools_error.log for details

wangjiangyuan avatar Jun 09 '21 13:06 wangjiangyuan

I think the easiest way is to "grep -v out_of_frame trust_report.tsv > filtered_report.tsv" first and then run VDJTools.

I will check how VDJTools handle those out-of-frame recombinations.

mourisl avatar Jun 09 '21 17:06 mourisl

By the way, we have a script "trust-stats.py" in the repo that can calculate commonly used diversity statistics.

mourisl avatar Jun 09 '21 18:06 mourisl

Thank you. "trust-stats.py" is a wonderful tool to calculate. This script can fully meet my needs. And I wonder to know how to get tcrLibsize, bcrLibsize. Cause I want to compare the fraction of TCR α, β, γ, δ CDR3s between the tumor and normal group like this. image

wangjiangyuan avatar Jun 15 '21 12:06 wangjiangyuan

One way for that is to sum up the first column (count) in the report file. Another way is to use the results from trust-stats.py, and use the formula 1000*Richness / CPK to compute the number of reads for those chains. I will add a column for the abundance (e.g. library size or # of reads) in the trust-stats.py later.

mourisl avatar Jun 15 '21 18:06 mourisl

I added the scripts to trust-stats.py to calculate the counts. However I don't know how to get the library size. I even don't know what it means. image image

wangjiangyuan avatar Jun 16 '21 01:06 wangjiangyuan

By the way, I want to know the CD3R length distribution of all tumor / normal samples. Is it okay to add counts directly or need to be standardized? image

wangjiangyuan avatar Jun 16 '21 03:06 wangjiangyuan

How did you generate the table including those tcrLibSize? We can traceback how those numbers were generated.

For the CDR3 length, if you apply standardization, the mean length would be at 0, so it will not be helpful to compare tumor/normal. You can use count, or the fraction of count within a particular sample to take the abundance into account.

mourisl avatar Jun 16 '21 04:06 mourisl

Maybe I didn't make it clear. This table is reported in a paper, https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-019-0681-3 and my purpose is to get a similar table. Because all the results are from TRUST3 in the paper. Maybe I should ask the author for help.

wangjiangyuan avatar Jun 16 '21 04:06 wangjiangyuan

I think the library size is the total number of reads falling into the IGH/TRB/... region. The count in TRUST4 is the reads from the CDR3 region. So both numbers could reflect the relative abundance. I have tested the correlation of TRUST3 libsize and TRUST4's cdr3 count before, if my memory is correct, they correlated well.

mourisl avatar Jun 16 '21 04:06 mourisl

Maybe I can understand it like this, “tcrCount/tcrLibsize*1000” in TRUST3 is equal to ”TCR Count“ in TRUST4. I try to calculate the two values and found that these two values are similar.

wangjiangyuan avatar Jun 16 '21 05:06 wangjiangyuan

I think tcrCount/tcrLibsize * 1000 should be the value of CPK. But I'm not sure in the table, whether the tcrCount means the number of unique TCRs or the number of reads used for TCR assembly.

mourisl avatar Jun 16 '21 05:06 mourisl

Actually,the tcrCount means the number of reads used for TCR. 475+691+22+26=1214 image

wangjiangyuan avatar Jun 16 '21 06:06 wangjiangyuan

And I found that CPK in trust4 is Richness/Count, however you think that tcrCount/tcrLibsize * 1000 should be the value of CPK. The two CPKs may be have different meaning. I am very confused.

wangjiangyuan avatar Jun 16 '21 08:06 wangjiangyuan

I think you are right, the last column of TRUST3's txt output is the number of reads for the contigs containing CDR3. So this is kind of the CDR3 read count from TRUST4. Though I'm not sure why the adjustment by /tcrLibsize * 1000 would equal TRUST4's count. It could be that those TRUST3 include many V or C gene content, so the adjustment with tcrLibSize could reduce those effects.

mourisl avatar Jun 21 '21 07:06 mourisl

And I want to compare Richness and Count in two sample ( preferably a tumor and a normal sample) and want to find some CDR3 specifically appears in tumor tissues or more expression in tumor.

wangjiangyuan avatar Jul 13 '21 14:07 wangjiangyuan