TRUST4 icon indicating copy to clipboard operation
TRUST4 copied to clipboard

10X whitelist issue

Open Nusob888 opened this issue 2 years ago • 36 comments

Hi,

Sorry for spamming the issues. I am having an error come up whenever I try to add a whitelist.

essentially the run report gives me no additional explanation other than: failed: 11 at run-trust4 line 48

I have tried to look over the script but can't really identify why it should fail there.

If I remove the barcodeWhitelist parameter, it carries on as normal. I have also triple checked the whitelists and can match them to the dataset fine.

Is it possible to check if this is a reproducible error? or something specific to my run?

It occurs on all datasets I have tried it on.

Thanks

Nusob888 avatar Aug 28 '21 17:08 Nusob888

Could you please show me the full command you used? Thanks.

mourisl avatar Aug 28 '21 17:08 mourisl

Hi,

sure thing, this was for chemistry v3, but I have tried it for V1 and V2, same thing happens. Each time I adjusted the ranges to fit the chemistries

` run-trust4 -t 24 -f ~/hg38_bcrtcr.fa --ref ~/human_IMGT+C.fa -u ~/XXXXX_2 --barcode ~/XXXXX_1, --barcodeRange 0 15 + --barcodeWhitelist ~/3M-february-2018.txt --UMI ~/XXXXX_1 --umiRange 16 27 + --od ~/outdir/

`

Nusob888 avatar Aug 28 '21 18:08 Nusob888

Could you please show me the full on-screen output from TRUST4? I need to make sure that the crash happened in the fastq-extractor program. Thanks.

mourisl avatar Aug 28 '21 18:08 mourisl

Sorry, our cluster locks out copy and pasting the output and I have to keep the study IDs of the samples confidential. But yes it occurs just as the fastq-extractor starts. It then prints out "system" followed by the input commands as above. Then followed by failed: 11 at run-trust4 line 48.

Is that enough? if not I can email you a screen capture

Nusob888 avatar Aug 28 '21 18:08 Nusob888

Just want to make sure, the 3M-february-2018.txt file is compressed in the cellranger package, so have you decompressed it before running TRUST4?

Did you see the line " Start to extract candidate reads from read files." output on the screen?

mourisl avatar Aug 28 '21 18:08 mourisl

Hi, yes I took it from a local installation of cellranger and uploaded it on our cluster within a custom directory that I then supply the absolute path to.

The reason was because our cell ranger is a module installation that cannot be edited by end-users.

But the file is the decompressed txt.

What I see is:

[Sat Aug 28 19:42:51 2021] Start to extract candidate reads from read files. system ~/TRUST4/fastq-extractor -t 24 -f ~/hg38_bcrtcr.fa -o ~/out_dir//TRUST_XXXXX_2_toassemble --barcodeStart 0 --barcodeEnd 15 --barcodeWhitelist ~/3M-february-2018.txt --umiStart 16 --umiEnd 27 -u ~/XXXX_2.fastq.gz --barcode ~/XXXX_1.fastq.gz --UMI ~/XXXX_1.fastq.gz failed: 11 at ~/TRUST/run-trust4 line 48

Nusob888 avatar Aug 28 '21 18:08 Nusob888

This is very strange. Could you please show me the first few lines for XXX_1.fastq.gz, XXX_2.fastq.gz and 3M-february-2018.txt files?

mourisl avatar Aug 28 '21 19:08 mourisl

it is just the absolute paths to the locations:

for our system its /groupdir/users/username/projects/projectx/gex/XXX_1.fastq.gz /groupdir/users/username/projects/projectx/gex/XXX_2.fastq.gz /groupdir/users/username/reference/cellranger_whitelists/v3/3M-february-2018.txt

I would be surprised if that is an issue as trust4 runs to completion when I do not include the whitelist. It will extract the barcodes and umis and reads fine and generate the outputs.

It seems to be something to do with the whitelist matching and correction. The only thing might be if you call directly on any shared libraries or /bin located files, we do not have access to those. Or if under the hood you are calling on cell ranger? but I didn't see that in the script.

Nusob888 avatar Aug 28 '21 19:08 Nusob888

Sorry, I mean the first few lines of the content in those files, so I can check whether the lengths, especially in the whitelist file, match the files on our server. Thanks.

mourisl avatar Aug 28 '21 19:08 mourisl

Oh right. sorry.

R1 @XXXXX.1 1 length=28 CNAAGCGAGCTTACGACGCCTTTCATGC +XXXXX.1 1 length=28 F#F,FF,FFFFFFF:,FFFFFFFFFFFF @XXXXX.2 2 length=28 AATGCCAAGAACGCGTAGACATCTAGCT +XXXXX.2 2 length=28 FFFFFFFFFFFFFFFFFFFFFFFFFFFF @XXXXX.3 3 length=28 CCTCACACACAACCGCTTCTGGTGCAGA

R2 @XXXXX.1 1 length=151 GAAGAATGGTACAAATCCAAGTTTGCTGACCTCTCTGAGGCTGCCAACCGGAACAATGACGCCCTGCGCCAGGCAAAGCAGGAGTCCACTGAGTACCGGAGACAGGTGCAGTACCTCACCTGTGAAGTGGATGCCCTTAAAGGAACCAAAA +XXXXX.1 1 length=151 FFFFFFF:FFFFFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFFF:F:FFFF:FFFFFFFFF::FFFFFFFFFFF,FFFFFFFFFFFFFFFFFF,FFFFF:FFFFFFFFFFF,FFFFFF:FFFFFFFFFF,FFF:FFFFFFFF:,,FF:F: @XXXXX.2 2 length=151 AGCCACTGCGCCCAGCCGGTCTTACCCACTTTTATATATATTTGAAACACAGTGGAGTAGATACCTAATAAATATTTGTTAGATTTATCTGAAGAAACTTAGATTAAATTCACTCGGAAGGTGCCTTATGTCAGGGATACTAAAAAATTAT +XXXXX.2 2 length=151 FFFFFFFFF,FFFFFFF,FFFFF:FFFFF::FFFFFFF:F:FFFFFFFFF,FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FF,FF,FFFFFFF:FFFFFFFFFFF,FFFFFFF:FF,FF,,FFFFF,F,:,F::,F:FFFF @XXXXX.3 3 length=151 GCCTGGTGGCTCACACCTGTAATCCTAGCACTTTAGGAGGCCAAGGGGGGTGGATCACGAGGTCAGGGGTTCGAGACCAGCCTGACCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCTGGGCGTGGTGGCAGGCGCCT

whitelist: AAACCCAAGAAACACT AAACCCAAGAAACCAT AAACCCAAGAAACCCA AAACCCAAGAAACCCG AAACCCAAGAAACCTG AAACCCAAGAAACGAA AAACCCAAGAAACGTC AAACCCAAGAAACTAC AAACCCAAGAAACTCA AAACCCAAGAAACTGC AAACCCAAGAAACTGT AAACCCAAGAAAGAAC AAACCCAAGAAAGACA AAACCCAAGAAAGCCT AAACCCAAGAAAGCGA AAACCCAAGAAAGGAT AAACCCAAGAAAGGTA AAACCCAAGAAAGTCT AAACCCAAGAAATAGG AAACCCAAGAAATCCA AAACCCAAGAAATGAG

Nusob888 avatar Aug 28 '21 19:08 Nusob888

Sorry, can I just add a request onto the same issue chain.

Is it possible to index the annot.fa with the report.tsv or cdr3.out?

or Add read counts into the annot.fa names?

I am trying to re-annotate the annot.fa assembly contigs with Igblast but there is no way to index back to the read counts. Since annot.fa is a consensus contig, the read fragment count is lost.

This would be really helpful as a feature since I predominantly use Immcantation as my analysis workflow to get mutation counts etc. But in its current state, I cannot do it in the same way as I can with MIXCR. In MIXCR indexing is possible as they provide the "targetsequence" (consensus assembly) in their report file.

Thanks very much for the consideration.

Nusob888 avatar Aug 30 '21 01:08 Nusob888

The first column in the cdr3.out file is the contig id in the annot.fa file, which can be used for index. Or do you mean other way of indexing? The report.tsv file is the summary of cdr3.out file, and the second last column is the representative's contig id in the annot.fa file.

For the whitelist issue, it finished without any error on our data even when testing with different settings. I may need some more time to test. In our previous experiment, the barcode correction marginally improved the results.

mourisl avatar Aug 30 '21 01:08 mourisl

Thanks for the update on the whitelist issue and ongoing troubleshooting.

For the indexing, what I have noticed, is that there are multiple duplicated assemble ids. e.g. assemble0 occurs up to 20 times for me.

When I look closer, this is the case in the report also, and it seems this is down to the fact that trust4 allows for SMH, and therefore multiple cids can be assigned to different sequence contigs.

The cdr3. out file nicely contains a consensus_id index, but this is not present in the annot.fa and therefore we cannot index the exact sequence in the annot.fa file to the report or cdr3.out file. Also the cdr3.out file oddly has read counts that look like averages? 76.08 for example, whereas the report has round numericals

Nusob888 avatar Aug 30 '21 02:08 Nusob888

The sequence contig in the annot.fa file is the consensus and encodes/compresses highly similar sequences, such as from SHMs as you mentioned. Therefore, in the _cdr3.out file, TRUST4 also outputs those minor CDR3s encoded in the consensus. In other words, the CDR3s are all from the sequence in annot.fa, and they are listed by the second column "index_within_consensus" in cdr3.out. I think for the CDR3s from the same contig/consensus, you can pick the one with highest abundance as representative.

As for the decimal numbers of the abundance, it is because that when TRUST4 tries to decode the CDR3 encoded in the consensus, a read partially overlapped with CDR3 region could support more than 1 CDR3 type. Therefore, TRUST4 applied EM algorithm to quantify the CDR3s from the same consensus.

Hope these explanations help.

mourisl avatar Aug 30 '21 02:08 mourisl

Thanks, that makes sense.

However, I remain a little confused how trust4 works in this case:

  • Is the consensus assembly then the most abundant assembled contig from a group of similar receptor sequences?
  • Why does trust4 not output a fasta of all assembled contigs? If you group sequences under a single consensus, are you not losing information about somatic hypermutation per recovered contig? For BCR analysis this is an important aspect when looking at clonal lineages. Many prefer to use custom clone definitions of varying stringency.

Lastly, I would prefer to index all annot.fa contigs with those in the cdr3.out or report.tsv files. The trouble with picking a highly abundant one is that the nucleotide sequences are not identical, therefore for diversity analysis, you are losing information if you just pick by consensus or abundance for use in other tools.

This seems like it could be easily resolved by adding a index_within_consensus ID into the annot.fa file?

Nusob888 avatar Aug 30 '21 02:08 Nusob888

Yes, you can regard the consensus assembly being the most abundant clonotype. The receptor sequence is much longer than a read, for example, it is difficult to decide whether two variations in V and J genes respectively are from the same receptor sequence. So TRUST4 only tries to recover the encoded sequence in CDR3, which in many cases can be fully covered by one read.

I think I can write a simple script to break up consensuses in annot.fa file with information from _cdr3.out file, so they have a one-to-one mapping.

mourisl avatar Aug 30 '21 02:08 mourisl

That would be great. Thank you.

Alternatively, if the fully assembled contig could be added to the report.tsv and cdr3.out file, that could be good as a user can then generate a fasta file directly from the contig.

re: how trust4 recovers sequences: Does this mean that trust4 does not actually assemble a full contig per B cell receptor variation?

In a B cell receptor I would consider any variations of V, junction, J to be potentially relevant and warrants assembly into a full contig so long as the read fragments were of good quality and the overlaps with high confidence.

This is how Bracer and MIXCR handle their contigs.

I understand the logic from the methods section of trust4, that from a computational efficiency perspective the consensus system is faster. However, from an analysis side, grouping to a single assembly loses a lot of information and granularity on how diverse an individual clonal family is. Many BCR analysis specifically look at the phylogeny of individual clones of interest to understand clonal selection/affinity maturation.

Is there a possibility to generate variations of a consensus sequence by knitting reads of variance into their own individual contigs. e.g. if a read has 2 mutations in CDR1 region, but otherwise aligns exactly to the consensus, then annot.fa will add an additional contig that is the consensus but deviates in those 2 nucleotides in CDR1?

Nusob888 avatar Aug 30 '21 02:08 Nusob888

If there are enough variations, TRUST4 will break them up into different contigs. The full contigs are for the highly expressed receptors. I think Bracer is for single-cell assemblies. In the single-cell setting, TRUST4 will not mix the variations from different cells into the same consensus. As for the last part, if a read has 2 mutations in CDR1, but there could be mutations in CDR2 or CDR3 from other reads, and it is very difficult to knit them together in bulk RNA-seq setting.

mourisl avatar Aug 30 '21 03:08 mourisl

Thanks for explaining, that makes sense now.

In that case the only desired addition, if possible, would be to break up annot.fa to match cdr3.out and report.tsv or to have consensus full length sequences as an additional column in the report.tsv. That would help a lot with indexing back to the read counts for all within consensus ids

Nusob888 avatar Aug 30 '21 09:08 Nusob888

I just add a perl script AddSequenceToCDR3File.pl to the scripts folder in the github repo. You can run it as: perl AddSequenceToCDR3File trust_cdr3.out trust_annot.fa > combined_cdr3.out It will add the assembled contig sequence to each row in the cdr3 file and also change the CDR3 sequence in the contig accordingly.

Please let me know whether it works.

mourisl avatar Aug 31 '21 01:08 mourisl

Hello, Sorry I left a comment here and hope didn't interrupt the discussion. I met the same error with whitelist. From my questions before I knew I need barcode whitelist correction. Now when I try to process it, I met same issue. Thus, I left a comment here. Next is my code and output message in terminal:

[000003 ~]$ ~/software/TRUST4/run-trust4 -t 8 -u ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R2_001.fastq.gz --barcode ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R1_001.fastq.gz --barcodeRange 0 15 + --barcodeWhitelist ~/project/sc/data/737K-august-2016.txt -f ~/software/TRUST4/hg38_bcrtcr.fa --ref ~/software/TRUST4/human_IMGT+C.fa -o ~/project/sc/output/trust4/gex_list/ [Tue Aug 31 10:36:49 2021] TRUST4 begins. [Tue Aug 31 10:36:49 2021] SYSTEM CALL: ~/software/TRUST4/fastq-extractor -t 8 -f ~/TRUST4/hg38_bcrtcr.fa -o ~/project/sc/output/trust4/gex_list/_toassemble --barcodeStart 0 --barcodeEnd 15 --barcodeWhitelist ~/project/sc/data/737K-august-2016.txt -u ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R2_001.fastq.gz --barcode ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R1_001.fastq.gz [Tue Aug 31 10:36:50 2021] Start to extract candidate reads from read files. system ~/software/TRUST4/fastq-extractor -t 8 -f ~/software/TRUST4/hg38_bcrtcr.fa -o ~/project/sc/output/trust4/gex_list/_toassemble --barcodeStart 0 --barcodeEnd 15 --barcodeWhitelist ~/project/sc/data/737K-august-2016.txt -u ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R2_001.fastq.gz --barcode ~/project/sc/data/fastq/vdj_nextgem_hs_pbmc3_5gex_protein_fastqs/vdj_nextgem_hs_pbmc3_5gex_protein_gex_fastqs/R1_001.fastq.gz failed: 139 at ~/software/TRUST4/run-trust4 line 48.

Here I changed all the home directory to '~' to delete my names in it. Also sorry for the long folder names of data.

I noticed one thing that in README the example code used --barcodeWhiteList and the parameter is --barcodeWhitelist with lower case in L of List. Will this make any differences?

Please tell me more about these. Thank you.

cwxiix avatar Aug 31 '21 02:08 cwxiix

That is a typo in README. It should be --barcodeWhitelist. I will fix that right away. Thank you!

mourisl avatar Aug 31 '21 03:08 mourisl

Hi @mourisl , this works a treat! thank you so much for that!

I will keep the thread open for @cwxiix and the whitelist issue, but this solves the indexing

Nusob888 avatar Sep 01 '21 14:09 Nusob888

I tried several different data sets on our server and could not reproduce this whitelist error. What system were you using? Was it MacOS?

mourisl avatar Sep 07 '21 01:09 mourisl

For me, I am using a Xshell terminal to connect to a server and the system is CentOS Linux release 7.5.1804 (Core) . Is this information helpful?

cwxiix avatar Sep 07 '21 02:09 cwxiix

Our cluster structure is a little complex, but each node runs a version of linux. Just to sanity check, are there any hardcoded parts in the script that require the whitelist to be in a cellranger library? because I stored the whitelists in a project specific directory. I had a look and it didn't appear to be the case, but thought I would check.

Nusob888 avatar Sep 07 '21 15:09 Nusob888

Hi,

I have a same problem with 'Nusob888'. The running code is here.

run-trust4 -f /mnt/S3/data2/workbench/Users/ktkim/program/TRUST4/hg38_bcrtcr.fa \
--ref /mnt/S3/data2/workbench/Users/ktkim/ref/tenX/refdata-gex-GRCh38-2020-A/fasta/genome.fa \
-u /data/rawdata/scHCC/TBD190769_20200514/01_fastq_file/46_N/*_R2_*.fastq.gz \
--barcode /data/rawdata/scHCC/TBD190769_20200514/01_fastq_file/46_N/*_R1_*.fastq.gz \
--barcodeRange 0 15  \
--barcodeWhitelist /mnt/S3/data2/workbench/Users/ktkim/program/cellranger4.0.0/lib/python/cellranger/barcodes/translation/3M-february-2018.txt

and error is like this.

failed: 11 at /mnt/S3/data2/workbench/Users/ktkim/program/TRUST4/run-trust4 line 48.

Those fasta files came from 10X, and some infomation is here.

R1 :

@A00930:41:HWGVCDSXX:3:1101:5104:1000 1:N:0:AAGCAGTC
NCAATCTTCACGAAGGCCCAGCCGCA
+
#FFFFFFFFFFFFFFFFFFFFFFFFF

R2 :

@A00930:41:HWGVCDSXX:3:1101:5104:1000 2:N:0:AAGCAGTC
GACATCCGAGGATGGATTGAAGGTAGGATAGGGGCTCACCGCTGATCCGGGACCACCTTTGGATGACTTCACAGTTTGAACATATTCCTGCTCTTCAT
+
FFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFF,FFFFFFFFFF:FFFFFFFFFFFFFFFFFFF

whitelist came from cellranger :

AAACCCAAGAAACACT        AAACCCATCAAACACT
AAACCCAAGAAACCAT        AAACCCATCAAACCAT
AAACCCAAGAAACCCA        AAACCCATCAAACCCA
AAACCCAAGAAACCCG        AAACCCATCAAACCCG
AAACCCAAGAAACCTG        AAACCCATCAAACCTG
AAACCCAAGAAACGAA        AAACCCATCAAACGAA
AAACCCAAGAAACGTC        AAACCCATCAAACGTC
AAACCCAAGAAACTAC        AAACCCATCAAACTAC
AAACCCAAGAAACTCA        AAACCCATCAAACTCA
AAACCCAAGAAACTGC        AAACCCATCAAACTGC

Thanks to read.

ttab963 avatar Sep 12 '21 08:09 ttab963

@ttab963 I think the reason for your case might be different. There are two issues with your command:

  1. --barcodeRange option should also have a third value to specify the strand, so it should be "--barcodeRange 0 15 +"
  2. This issue does not relate to the barcode. The --ref option is not for the reference genome, it is for the IMGT reference, so it should be "--ref /mnt/S3/data2/workbench/Users/ktkim/program/TRUST4/human_IMGT+C.fa".

mourisl avatar Sep 12 '21 16:09 mourisl

Thanks for your reply. I changed the code, but same error occurs.

nohup run-trust4 -f /mnt/S3/data2/workbench/Users/ktkim/program/TRUST4/hg38_bcrtcr.fa \
--ref /mnt/S3/data2/workbench/Users/ktkim/program/TRUST4/human_IMGT+C.fa \
-u /data/rawdata/scHCC/TBD190769_20200514/01_fastq_file/46_N/*_R2_*.fastq.gz \
--barcode /data/rawdata/scHCC/TBD190769_20200514/01_fastq_file/46_N/*_R1_*.fastq.gz \
--barcodeRange 0 15 + \
--barcodeWhitelist /mnt/S3/data2/workbench/Users/ktkim/program/cellranger4.0.0/lib/python/cellranger/barcodes/translation/3M-february-2018.txt &

ttab963 avatar Sep 13 '21 00:09 ttab963

@ttab963 Your whitelist is from the cellranger translation folder. It contains two columns maybe for the translation of multiome data. TRUST4 assumes each row contains one barcode, so you should use the barcode in the path /mnt/S3/data2/workbench/Users/ktkim/program/cellranger4.0.0/lib/python/cellranger/barcodes/ instead of that in the translation folder. (You still need to unzip the whitelist there).

mourisl avatar Sep 13 '21 01:09 mourisl