CITE-seq-Count icon indicating copy to clipboard operation
CITE-seq-Count copied to clipboard

CITE-SEQ-COUNT 100% unmapped

Open ZeyanZhang opened this issue 5 years ago • 20 comments

Hi, thanks so much for developing this great tool. I just run a 10x Genomics scRNA sequencing on a pool of hashtag multiplexed sample. I tried to use the cmd below to generate the cell-hashtag count matrix. CITE-seq-Count -R1 HTO-R1.fastq.gz -R2 HTO-R2.fastq.gz -t abTags.csv -cbf 1 -cbl 16 -umif 17 -umil 26 -wl barcodes.txt -o result -cells 19210 -n10000 --debug

But I got the [warning] below. "Read1 length is 28bp but you are using 26bp for Cell and UMI barcodes combined. This might lead to wrong cell attribution and skewed umi counts."

Also, the run_report.yaml is below. Can you please advise what wrong? Appreciated to any suggestions.

Running time: 1.005 seconds CITE-seq-Count Version: 1.4.2 Reads processed: 10000 Percentage mapped: 0 Percentage unmapped: 100 Uncorrected cells: 0 Correction: Cell barcodes collapsing threshold: 1 Cell barcodes corrected: 0 UMI collapsing threshold: 2 UMIs corrected: 11 Run parameters: Read1_filename: HTO-R1.fastq.gz Read2_filename: HTO-R2.fastq.gz Cell barcode: First position: 1 Last position: 16 UMI barcode: First position: 17 Last position: 26 Expected cells: 19210 Tags max errors: 2 Start trim: 0

Thanks a lot!

ZeyanZhang avatar Jun 06 '19 20:06 ZeyanZhang

Hello @ZeyanZhang,

there is a lot that could go wrong.

First step to check what's happening is simply grepping one TAG on R2 and see if you get any hits. Also check where it hits. At the start of the read, in the middle?

Can you do this and paste here some of those results?

Hoohm avatar Jun 07 '19 09:06 Hoohm

Hello @Hoohm , Thanks so much! I do see a lot of hits by grep, and they are in the middle and followed by poly As, please see below of some of the results. image

Thanks again!

ZeyanZhang avatar Jun 07 '19 14:06 ZeyanZhang

Great. This shows that you need to use the --sliding-window option and I'd recommend adding --start-trim. The value of the start trim should be around the mean number of bases of the green part before your tags. I can't really count it from the image but you can get the number with your greps.

Try adding those options and let me know how it goes.

Hoohm avatar Jun 07 '19 15:06 Hoohm

Thanks @Hoohm , I adjusted as your suggestion below, and I changed -umil from 26 to 28 because the R1 length is 28 and 10x v3 CB+UMI is 28. The results still 100% unmapped. I also put a couple of logouts below. Can you see any other problems. Thanks!

CITE-seq-Count -R1 HTO-R1.fastq.gz -R2 HTO-R2.fastq.gz -t abTags.csv -cbf 1 -cbl 16 -umif 17 -umil 28 -wl barcodes.txt --sliding-window --start-trim 45 -o result -cells 19210 --debug

line:GAATAGAAGGAACTATACTGCTGCCGTTNAGCTTCGTGCCTTCTCTCATCTCCCCGACTGAAACTGCTCTTGTTTGAAGGCACGTGACTATCGAAGATGCTGGCGTCAGGAGACTTTAG cell_barcode:GAATAGAAGGAACTAT UMI:b'ACTGCTGCCGTT' TAG_seq:TTGAAGGCACGTGACTATCGAAGATGCTGGCGTCAGGAGACTTTAG line length:119 cell barcode length:16 UMI length:12 TAG sequence length:46 Best match is: unmapped

line:TGAATCGGTGAGCTCCTCCATGTTTTACNCTCATATCTGCCTTTGATGTGTGAAAGACACTCCCAGCTGGAGGAGAGTACAAGAAAGATCTAAAATATTTGCTTCAGCTGCAAAGAGCT cell_barcode:TGAATCGGTGAGCTCC UMI:b'TCCATGTTTTAC' TAG_seq:AGAGTACAAGAAAGATCTAAAATATTTGCTTCAGCTGCAAAGAGCT line length:119 cell barcode length:16 UMI length:12 TAG sequence length:46 Best match is: unmapped

ZeyanZhang avatar Jun 07 '19 15:06 ZeyanZhang

And if I grep one tag on logouts, I do see best matches. But it's still strange that the result is still 100% unmapped.

Best match is: EGFR-GCTTAACATTGGCAC line:TCATGCCTCGAGAGACGGTTAGAGGACCAAGCAGTGGTATCAACGCAGAGTACATGGGGCACCCGAGAATTCCAGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA cell_barcode:TCATGCCTCGAGAGAC UMI:b'GGTTAGAGGACC' TAG_seq:AGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Best match is: EGFR-GCTTAACATTGGCAC Best match is: EGFR-GCTTAACATTGGCAC line:TCAGGTAGTTGCTCAACAGTTTTTGTGAAAGCAGTGGTATCAACGCAGAGTACATGGGGCACCCGAGAATTCCAGCTTAACATTGGCACGAAAAAAAAAAAAAAAAAAAAAAAAAAAAA cell_barcode:TCAGGTAGTTGCTCAA UMI:b'CAGTTTTTGTGA' TAG_seq:AGCTTAACATTGGCACGAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Best match is: EGFR-GCTTAACATTGGCAC line:CTCCAACAGCGGTAACTTACTCTCGAGGAAGCAGTGGTATCAACGCAGAGTACATGGGGGCACCCGAGAATTCCAGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAAAAAA cell_barcode:CTCCAACAGCGGTAAC UMI:b'TTACTCTCGAGG' TAG_seq:CAGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Best match is: EGFR-GCTTAACATTGGCAC Best match is: EGFR-GCTTAACATTGGCAC line:TTCGCTGCAATCTCGACATAGATTTTAAAAGCAGTGGTATCAACGCAGAGTACATGGGGCACCCGAGAATTCCAGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAATTAAA cell_barcode:TTCGCTGCAATCTCGA UMI:b'CATAGATTTTAA' TAG_seq:AGCTTAACATTGGCACAAAAAAAAAAAAAAAAAAAAAAAAATTAAA Best match is: EGFR-GCTTAACATTGGCAC Best match is: EGFR-GCTTAACATTGGCAC Best match is: EGFR-GCTTAACATTGGCAC image

ZeyanZhang avatar Jun 10 '19 13:06 ZeyanZhang

Hello @ZeyanZhang, sorry I kind of forgot to come back to you. The logs you're showing me should provide some mapped content. Have you tried the dev branch?

Hoohm avatar Oct 01 '19 15:10 Hoohm

@Hoohm I have same issue here. We change our 10X protocol from totalA to tatalC and now we have around 10 cycle before our tag at R2, i try to use the sliding-window as you mentioned but still not work. here is the command:

CITE-seq-Count -R1 BEI11908_BM-Bcell_620_L001_R1_001.fastq.gz -R2 BEI11908_BM-Bcell_620_L001_R2_001.fastq.gz -t TSC_AbTag.csv -cbf 1 -cbl 16 -umif 17 -umil 26 --sliding-window -cells 5000 --max-error 3 -o B620 Do I need to specific anything after the --sliding-window option?

here is the error:

Finding a whitelist /anaconda3/lib/python3.7/site-packages/umi_tools/whitelist_methods.py:283: RuntimeWarning: invalid value encountered in sqrt lineVecNorm = lineVec / np.sqrt(np.sum(lineVec**2)) Traceback (most recent call last): File "/anaconda3/bin/CITE-seq-Count", line 11, in sys.exit(main()) File "/anaconda3/lib/python3.7/site-packages/cite_seq_count/main.py", line 352, in main collapsing_threshold=args.bc_threshold) File "/anaconda3/lib/python3.7/site-packages/cite_seq_count/processing.py", line 310, in correct_cells plotfile_prefix=False) File "/anaconda3/lib/python3.7/site-packages/umi_tools/whitelist_methods.py", line 447, in getCellWhitelist cell_barcode_counts, cell_number, plotfile_prefix) File "/anaconda3/lib/python3.7/site-packages/umi_tools/whitelist_methods.py", line 322, in getKneeEstimateDistance raise ValueError("Something's gone wrong here!!") ValueError: Something's gone wrong here!!

I'm using the newest version CITE-seq-Count

YunZheHuang avatar Oct 23 '19 17:10 YunZheHuang

@ZeyanZhang did you solve the problem?

YunZheHuang avatar Oct 23 '19 17:10 YunZheHuang

@YunZheHuang Would you mind sending me a sample of the data so that I can look at it?

Hoohm avatar Oct 24 '19 11:10 Hoohm

@Hoohm

Hi, thanks a lot for making this cool software

I have a problem analyzing my hashcoding samples, I've got 10x data and I've run CITE-seq to pre-process raw reads, I have multiple samples and it's always 100% of unmapped tags, here is a report

Date: 2020-03-18 Running time: 59.0 minutes, 37.49 seconds CITE-seq-Count Version: 1.4.3 Reads processed: 21737257 Percentage mapped: 0 Percentage unmapped: 100 Uncorrected cells: 2 Correction: Cell barcodes collapsing threshold: 1 Cell barcodes corrected: 86382 UMI collapsing threshold: 2 UMIs corrected: 3419837 Run parameters: Read1_paths: SJ-2366-Adam-barcode-15_S60_L001_R1_001.fastq.gz Read2_paths: SJ-2366-Adam-barcode-15_S60_L001_R2_001.fastq.gz Cell barcode: First position: 1 Last position: 16 UMI barcode: First position: 17 Last position: 26 Expected cells: 5000 Tags max errors: 2 Start trim: 0

In my case I don't have tags spread in different positions in R2 files, it always in the same place

Could you please help me to solve it?

Thanks a lot Screenshot 2020-03-19 at 12 31 50

GrigoriiNos avatar Mar 19 '20 09:03 GrigoriiNos

@GrigoriiNos in the documentation there is a section about special cases

You need to use the --start-trim with 10

Hoohm avatar Mar 21 '20 16:03 Hoohm

Hi!

Thanks for developing this great tool! I have a very similar problem. We sequenced 5' scRNA-seq with Biolegend TotalSeq C hash tags and I get 100% unmapped. However, the hashtags are in the R2 read.

Here is my command: CITE-seq-Count -R1 HTO_index_S6_combined_R1.fastq.gz -R2 short_HTO_index_S6_combined_R2.fastq.gz -t list_HTO.csv -cbf 1 -cbl 16 -umif 17 -umil 26 --expected_cells 24000 --output HTO_test_combined

Here is the list with my barcodes: ACCCACCAGTAAGAC,HTO1 GGTCGAGAGCATTCA,HTO2 CTTGCCGCATGTCAT,HTO3 AAAGCATTCTTCACG,HTO4 CTTTGTCTTTGTGAG,HTO5

When I look at the debug log file I can find many lines like this: line:TACCCACAGGGAGGGTTGTGATCAAGTAACCCACCAGTAAGAC cell_barcode:TACCCACAGGGAGGGT UMI:b'TGTGATCAAG' TAG_seq:ACCCACCAGTAAGAC line length:43 cell barcode length:16 UMI length:10 TAG sequence length:15 Best match is: unmapped

However, the TAG sequence overlaps 100% with my HTO1. Could you please advice what I do wrong?

Thanks, Verena

vlink avatar Apr 08 '20 20:04 vlink

Hey @vlink I think there is in issue with indexes. It seems like your sequences start at base 2 on R2. Can you try this command: zcat short_HTO_index_S6_combined_R2.fastq.gz | head 100 | grep ACCCACCAGTAAGAC and let me know if my suspicions are correct?

Hoohm avatar Apr 09 '20 06:04 Hoohm

Thanks for the quick reply.

I double checked and it does not seem to be the case.

Here are the first two lines of the grep output: ACCCACCAGTAAGAC ACCCACCAGTAAGAC

vlink avatar Apr 09 '20 12:04 vlink

There is still something strange here:

TACCCACAGGGAGGGT TGTGATCAAGTAACCCACCAGTAAGAC
TACCCACAGGGAGGGT TGTGATCAAG  ACCCACCAGTAAGAC

Where is this TA coming from?

Hoohm avatar Apr 18 '20 09:04 Hoohm

I think you might need to change your -umil to 12

Hoohm avatar Apr 18 '20 09:04 Hoohm

Hi @Hoohm and everyone working on this issue:

I got the same warning([WARNING] Read1 length is 28bp but you are using 26bp for Cell and UMI barcodes combined. This might lead to wrong cell attribution and skewed umi counts) in the beginning but final report showed 94% mapping. I wonder if this result should be worrisome because of the warning message.

Here is the report: Date: 2021-02-06 Running time: 18.0 minutes, 2.204 seconds CITE-seq-Count Version: 1.4.4 Reads processed: 25470272 Percentage mapped: 94 Percentage unmapped: 6 Uncorrected cells: 0 Correction: Cell barcodes collapsing threshold: 1 Cell barcodes corrected: 7544 UMI collapsing threshold: 2 UMIs corrected: 3638 Run parameters: Read1_paths: /Users/yingzhengxu/Desktop/count_matrix/L004_R1.gz Read2_paths: /Users/yingzhengxu/Desktop/count_matrix/L004_R2.gz Cell barcode: First position: 1 Last position: 16 UMI barcode: First position: 17 Last position: 26 Expected cells: 33734 Tags max errors: 2 Start trim: 0

YingzhengXu avatar Feb 06 '21 23:02 YingzhengXu

The warning is only a warning. It mostly comes from the fact that people sequence a bit deeper than what they need to, in your case, it's probably a wrong input argument.

Usually, cell barcodes are between 1 and 16 and then the UMI is 17 to 28 (length of 12), at least this is the default on recent 10x runs.

In your specific case, I would rerun the sample with the adjusted values, change 26 to 28, you might see a very small increase in UMI counts.

The mapping rate is fine and has nothing to do with the warning you got.

Hope this helps

Hoohm avatar Feb 07 '21 15:02 Hoohm

I've been experiencing a similar issue as @ZeyanZhang, where a grep of log files indicates successful matches of tags, yet still results in 100% unmatched. Is there a suggested route of troubleshooting? Thank you in advance for any guidance.

CITE-seq-Count Version: 1.4.4 Reads processed: 100000 Percentage mapped: 0 Percentage unmapped: 100 Uncorrected cells: 0 Correction: Cell barcodes collapsing threshold: 1 Cell barcodes corrected: 363 UMI collapsing threshold: 2 UMIs corrected: 68 Run parameters: Read1_paths: Pool1_HTO/Pool1_HTO_S9_R1_001.fastq.gz Read2_paths: Pool1_HTO/Pool1_HTO_S9_R2_001.fastq.gz Cell barcode: First position: 1 Last position: 16 UMI barcode: First position: 17 Last position: 28 Expected cells: 9600 Tags max errors: 2 Start trim: 45 image

image

carrowjk avatar Aug 04 '21 03:08 carrowjk

Hi all,

I was having a similar issue where 50% of cells were unmapped. I thought that might be really. Any ideas why so many cells were unmapped? Bad antibody staining? Or the command is wrong?

Date: 2022-03-13 Running time: 2.0 hours, 20.0 minutes, 45.9 seconds CITE-seq-Count Version: 1.4.3 Reads processed: 220187173 Percentage mapped: 53 Percentage unmapped: 47 Uncorrected cells: 64 Correction: Cell barcodes collapsing threshold: 1 Cell barcodes corrected: 746524 UMI collapsing threshold: 2 UMIs corrected: 20304552 Run parameters: Read1_paths: UE_FB_FL_S5_R1_001.fastq.gz Read2_paths: UE_FB_FL_S5_R2_001.fastq.gz Cell barcode: First position: 1 Last position: 16 UMI barcode: First position: 17 Last position: 28 Expected cells: 39238 Tags max errors: 2 Start trim: 10

YingzhengXu avatar Mar 13 '22 22:03 YingzhengXu