CITE-seq-Count
CITE-seq-Count copied to clipboard
CITE-SEQ-COUNT 100% unmapped
Hi, thanks so much for developing this great tool. I just run a 10x Genomics scRNA sequencing on a pool of hashtag multiplexed sample. I tried to use the cmd below to generate the cell-hashtag count matrix. CITE-seq-Count -R1 HTO-R1.fastq.gz -R2 HTO-R2.fastq.gz -t abTags.csv -cbf 1 -cbl 16 -umif 17 -umil 26 -wl barcodes.txt -o result -cells 19210 -n10000 --debug
But I got the [warning] below. "Read1 length is 28bp but you are using 26bp for Cell and UMI barcodes combined. This might lead to wrong cell attribution and skewed umi counts."
Also, the run_report.yaml is below. Can you please advise what wrong? Appreciated to any suggestions.
Running time: 1.005 seconds CITE-seq-Count Version: 1.4.2 Reads processed: 10000 Percentage mapped: 0 Percentage unmapped: 100 Uncorrected cells: 0 Correction: Cell barcodes collapsing threshold: 1 Cell barcodes corrected: 0 UMI collapsing threshold: 2 UMIs corrected: 11 Run parameters: Read1_filename: HTO-R1.fastq.gz Read2_filename: HTO-R2.fastq.gz Cell barcode: First position: 1 Last position: 16 UMI barcode: First position: 17 Last position: 26 Expected cells: 19210 Tags max errors: 2 Start trim: 0
Thanks a lot!
Hello @ZeyanZhang,
there is a lot that could go wrong.
First step to check what's happening is simply grepping one TAG on R2 and see if you get any hits. Also check where it hits. At the start of the read, in the middle?
Can you do this and paste here some of those results?
Hello @Hoohm ,
Thanks so much! I do see a lot of hits by grep, and they are in the middle and followed by poly As, please see below of some of the results.
Thanks again!
Great. This shows that you need to use the --sliding-window
option and I'd recommend adding --start-trim
. The value of the start trim should be around the mean number of bases of the green part before your tags. I can't really count it from the image but you can get the number with your greps.
Try adding those options and let me know how it goes.
Thanks @Hoohm , I adjusted as your suggestion below, and I changed -umil from 26 to 28 because the R1 length is 28 and 10x v3 CB+UMI is 28. The results still 100% unmapped. I also put a couple of logouts below. Can you see any other problems. Thanks!
CITE-seq-Count -R1 HTO-R1.fastq.gz -R2 HTO-R2.fastq.gz -t abTags.csv -cbf 1 -cbl 16 -umif 17 -umil 28 -wl barcodes.txt --sliding-window --start-trim 45 -o result -cells 19210 --debug
line:GAATAGAAGGAACTATACTGCTGCCGTTNAGCTTCGTGCCTTCTCTCATCTCCCCGACTGAAACTGCTCTTGTTTGAAGGCACGTGACTATCGAAGATGCTGGCGTCAGGAGACTTTAG cell_barcode:GAATAGAAGGAACTAT UMI:b'ACTGCTGCCGTT' TAG_seq:TTGAAGGCACGTGACTATCGAAGATGCTGGCGTCAGGAGACTTTAG line length:119 cell barcode length:16 UMI length:12 TAG sequence length:46 Best match is: unmapped
line:TGAATCGGTGAGCTCCTCCATGTTTTACNCTCATATCTGCCTTTGATGTGTGAAAGACACTCCCAGCTGGAGGAGAGTACAAGAAAGATCTAAAATATTTGCTTCAGCTGCAAAGAGCT cell_barcode:TGAATCGGTGAGCTCC UMI:b'TCCATGTTTTAC' TAG_seq:AGAGTACAAGAAAGATCTAAAATATTTGCTTCAGCTGCAAAGAGCT line length:119 cell barcode length:16 UMI length:12 TAG sequence length:46 Best match is: unmapped
And if I grep one tag on logouts, I do see best matches. But it's still strange that the result is still 100% unmapped.
Best match is: EGFR-GCTTAACATTGGCAC
line:TCATGCCTCGAGAGACGGTTAGAGGACCAAGCAGTGGTATCAACGCAGAGTACATGGGGCACCCGAGAATTCCAGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
cell_barcode:TCATGCCTCGAGAGAC UMI:b'GGTTAGAGGACC' TAG_seq:AGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Best match is: EGFR-GCTTAACATTGGCAC
Best match is: EGFR-GCTTAACATTGGCAC
line:TCAGGTAGTTGCTCAACAGTTTTTGTGAAAGCAGTGGTATCAACGCAGAGTACATGGGGCACCCGAGAATTCCAGCTTAACATTGGCACGAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
cell_barcode:TCAGGTAGTTGCTCAA UMI:b'CAGTTTTTGTGA' TAG_seq:AGCTTAACATTGGCACGAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Best match is: EGFR-GCTTAACATTGGCAC
line:CTCCAACAGCGGTAACTTACTCTCGAGGAAGCAGTGGTATCAACGCAGAGTACATGGGGGCACCCGAGAATTCCAGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
cell_barcode:CTCCAACAGCGGTAAC UMI:b'TTACTCTCGAGG' TAG_seq:CAGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Best match is: EGFR-GCTTAACATTGGCAC
Best match is: EGFR-GCTTAACATTGGCAC
line:TTCGCTGCAATCTCGACATAGATTTTAAAAGCAGTGGTATCAACGCAGAGTACATGGGGCACCCGAGAATTCCAGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAATTAAA
cell_barcode:TTCGCTGCAATCTCGA UMI:b'CATAGATTTTAA' TAG_seq:AGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAATTAAA
Best match is: EGFR-GCTTAACATTGGCAC
Best match is: EGFR-GCTTAACATTGGCAC
Best match is: EGFR-GCTTAACATTGGCAC
Hello @ZeyanZhang, sorry I kind of forgot to come back to you. The logs you're showing me should provide some mapped content. Have you tried the dev branch?
@Hoohm I have same issue here. We change our 10X protocol from totalA to tatalC and now we have around 10 cycle before our tag at R2, i try to use the sliding-window as you mentioned but still not work. here is the command:
CITE-seq-Count -R1 BEI11908_BM-Bcell_620_L001_R1_001.fastq.gz -R2 BEI11908_BM-Bcell_620_L001_R2_001.fastq.gz -t TSC_AbTag.csv -cbf 1 -cbl 16 -umif 17 -umil 26 --sliding-window -cells 5000 --max-error 3 -o B620 Do I need to specific anything after the --sliding-window option?
here is the error:
Finding a whitelist
/anaconda3/lib/python3.7/site-packages/umi_tools/whitelist_methods.py:283: RuntimeWarning: invalid value encountered in sqrt
lineVecNorm = lineVec / np.sqrt(np.sum(lineVec**2))
Traceback (most recent call last):
File "/anaconda3/bin/CITE-seq-Count", line 11, in
I'm using the newest version CITE-seq-Count
@ZeyanZhang did you solve the problem?
@YunZheHuang Would you mind sending me a sample of the data so that I can look at it?
@Hoohm
Hi, thanks a lot for making this cool software
I have a problem analyzing my hashcoding samples, I've got 10x data and I've run CITE-seq to pre-process raw reads, I have multiple samples and it's always 100% of unmapped tags, here is a report
Date: 2020-03-18 Running time: 59.0 minutes, 37.49 seconds CITE-seq-Count Version: 1.4.3 Reads processed: 21737257 Percentage mapped: 0 Percentage unmapped: 100 Uncorrected cells: 2 Correction: Cell barcodes collapsing threshold: 1 Cell barcodes corrected: 86382 UMI collapsing threshold: 2 UMIs corrected: 3419837 Run parameters: Read1_paths: SJ-2366-Adam-barcode-15_S60_L001_R1_001.fastq.gz Read2_paths: SJ-2366-Adam-barcode-15_S60_L001_R2_001.fastq.gz Cell barcode: First position: 1 Last position: 16 UMI barcode: First position: 17 Last position: 26 Expected cells: 5000 Tags max errors: 2 Start trim: 0
In my case I don't have tags spread in different positions in R2 files, it always in the same place
Could you please help me to solve it?
Thanks a lot
@GrigoriiNos in the documentation there is a section about special cases
You need to use the --start-trim
with 10
Hi!
Thanks for developing this great tool! I have a very similar problem. We sequenced 5' scRNA-seq with Biolegend TotalSeq C hash tags and I get 100% unmapped. However, the hashtags are in the R2 read.
Here is my command: CITE-seq-Count -R1 HTO_index_S6_combined_R1.fastq.gz -R2 short_HTO_index_S6_combined_R2.fastq.gz -t list_HTO.csv -cbf 1 -cbl 16 -umif 17 -umil 26 --expected_cells 24000 --output HTO_test_combined
Here is the list with my barcodes: ACCCACCAGTAAGAC,HTO1 GGTCGAGAGCATTCA,HTO2 CTTGCCGCATGTCAT,HTO3 AAAGCATTCTTCACG,HTO4 CTTTGTCTTTGTGAG,HTO5
When I look at the debug log file I can find many lines like this: line:TACCCACAGGGAGGGTTGTGATCAAGTAACCCACCAGTAAGAC cell_barcode:TACCCACAGGGAGGGT UMI:b'TGTGATCAAG' TAG_seq:ACCCACCAGTAAGAC line length:43 cell barcode length:16 UMI length:10 TAG sequence length:15 Best match is: unmapped
However, the TAG sequence overlaps 100% with my HTO1. Could you please advice what I do wrong?
Thanks, Verena
Hey @vlink
I think there is in issue with indexes.
It seems like your sequences start at base 2 on R2.
Can you try this command: zcat short_HTO_index_S6_combined_R2.fastq.gz | head 100 | grep ACCCACCAGTAAGAC
and let me know if my suspicions are correct?
Thanks for the quick reply.
I double checked and it does not seem to be the case.
Here are the first two lines of the grep output: ACCCACCAGTAAGAC ACCCACCAGTAAGAC
There is still something strange here:
TACCCACAGGGAGGGT TGTGATCAAGTAACCCACCAGTAAGAC
TACCCACAGGGAGGGT TGTGATCAAG ACCCACCAGTAAGAC
Where is this TA coming from?
I think you might need to change your -umil
to 12
Hi @Hoohm and everyone working on this issue:
I got the same warning([WARNING] Read1 length is 28bp but you are using 26bp for Cell and UMI barcodes combined. This might lead to wrong cell attribution and skewed umi counts) in the beginning but final report showed 94% mapping. I wonder if this result should be worrisome because of the warning message.
Here is the report: Date: 2021-02-06 Running time: 18.0 minutes, 2.204 seconds CITE-seq-Count Version: 1.4.4 Reads processed: 25470272 Percentage mapped: 94 Percentage unmapped: 6 Uncorrected cells: 0 Correction: Cell barcodes collapsing threshold: 1 Cell barcodes corrected: 7544 UMI collapsing threshold: 2 UMIs corrected: 3638 Run parameters: Read1_paths: /Users/yingzhengxu/Desktop/count_matrix/L004_R1.gz Read2_paths: /Users/yingzhengxu/Desktop/count_matrix/L004_R2.gz Cell barcode: First position: 1 Last position: 16 UMI barcode: First position: 17 Last position: 26 Expected cells: 33734 Tags max errors: 2 Start trim: 0
The warning is only a warning. It mostly comes from the fact that people sequence a bit deeper than what they need to, in your case, it's probably a wrong input argument.
Usually, cell barcodes are between 1 and 16 and then the UMI is 17 to 28 (length of 12), at least this is the default on recent 10x runs.
In your specific case, I would rerun the sample with the adjusted values, change 26 to 28, you might see a very small increase in UMI counts.
The mapping rate is fine and has nothing to do with the warning you got.
Hope this helps
I've been experiencing a similar issue as @ZeyanZhang, where a grep of log files indicates successful matches of tags, yet still results in 100% unmatched. Is there a suggested route of troubleshooting? Thank you in advance for any guidance.
CITE-seq-Count Version: 1.4.4
Reads processed: 100000
Percentage mapped: 0
Percentage unmapped: 100
Uncorrected cells: 0
Correction:
Cell barcodes collapsing threshold: 1
Cell barcodes corrected: 363
UMI collapsing threshold: 2
UMIs corrected: 68
Run parameters:
Read1_paths: Pool1_HTO/Pool1_HTO_S9_R1_001.fastq.gz
Read2_paths: Pool1_HTO/Pool1_HTO_S9_R2_001.fastq.gz
Cell barcode:
First position: 1
Last position: 16
UMI barcode:
First position: 17
Last position: 28
Expected cells: 9600
Tags max errors: 2
Start trim: 45
Hi all,
I was having a similar issue where 50% of cells were unmapped. I thought that might be really. Any ideas why so many cells were unmapped? Bad antibody staining? Or the command is wrong?
Date: 2022-03-13 Running time: 2.0 hours, 20.0 minutes, 45.9 seconds CITE-seq-Count Version: 1.4.3 Reads processed: 220187173 Percentage mapped: 53 Percentage unmapped: 47 Uncorrected cells: 64 Correction: Cell barcodes collapsing threshold: 1 Cell barcodes corrected: 746524 UMI collapsing threshold: 2 UMIs corrected: 20304552 Run parameters: Read1_paths: UE_FB_FL_S5_R1_001.fastq.gz Read2_paths: UE_FB_FL_S5_R2_001.fastq.gz Cell barcode: First position: 1 Last position: 16 UMI barcode: First position: 17 Last position: 28 Expected cells: 39238 Tags max errors: 2 Start trim: 10