CITE-seq-Count
CITE-seq-Count copied to clipboard
What is the difference between read_count and uni_count?
Hey, first of all thank you so much for the incredibly useful and well documented CITE-seq-count tool. I was going through the documentation and also tried to figure out from the code what the difference is between the folders:
read_count and umi_count
I read that both are corrected for PCR duplicates? Any insight would be very helpful. Thanks heaps!
Hello @cpflueger2016
Maybe I should write a bit more details about that part. UMI counts are "demultiplexed" reads. Meaning that you get one count per combination of cell + UMI + TAG regardless of how reads of the same combination you get.
Read counts are not collapsed/demultiplexed. So for 1 UMI you might have 10 reads. Usually we use the umi count matrix for analysis.
I provide both because sometimes you want to have an idea of the complexity of your data or maybe protocol troubleshooting.
UMI counts output

Read counts output

Hey @Hoohm, thank you so much for your reply and the detailed explanation. I was thinking that this what it meant but really wanted to double-check. In fact, I do see a strange/unexpected difference between UMI_counts and Read_counts. I have attached both tables with the same cell barcodes below. Essentially, when looking at read_counts, I see the enrichment of tags (barcodes) I would expect. However, when looking at the UMI counts, it appears much noisier. Given that the RT should have added the UMI and cell barcode at the same step, I was wondering if you guys have seen such a "late" amplification of tags/barcodes from Antibodies that I would more likely classify as noise. I know that github is for bioinformatics analysis but I thought I ask if you have seen that before ;)
In any case, thanks again for this tremendous tool and that you have included the read_counts for troubleshooting. 👍
Hey @cpflueger2016 sorry for the late response.
It is a bit strange that you have some tags in some cells that are more amplified. I'm noticing this on tag_6 for cells 14,15,16 which don't behave the same at all.
Does it affect at all the downstream analysis?
Hey @Hoohm, thanks for the followup.
As you have pointed out, I do expect a much stronger amplification of tags in certain cells. These "barcodes" are driven by a promoter to distinguish between the origin of lentiviral cargo. In my analysis, the count table before UMI collapse resembles other readouts that I have also captured most closely.
For example, row 16, anything other than tag_6 is noise. Similarly, row 11 has the highest counts for tag_1 and tag_3 and the rest is noise.
However, that would be almost impossible to conclude from the UMI counts output, even though the highest counts in that row do match. It's just so strange and I cannot quite explain what is going on. I take you haven't seen something like this either.
I guess there is something in the protocol that behaves differently than expected. I haven't seen something like it yet no, sorry.
Would you be able to describe the protocol quickly to maybe have an idea of what's happening?
Sure thing. The tags (barcodes) are expressed in a cell after lentiviral transduction. They have a 5' PCR handle to attach the P7 handle by PCR and on the 3' the tags are polyA tailed. They gain the cell barcode and the UMI in the 10x chromium-based single cell library preparation during the RT (reverse transcription) reaction.
It does look to me that the "wrong" assignment of the tags to the cell barcodes could be due to sequencing error or error tolerance in the kmer analysis. I did run your program with 0 mismatches for UMIs and 0 mismatches for cell barcodes though.
Is there a debug output that I could check which read_id (in the fastq file) the tag assignments to the cell barcode are coming from?
Yes there is a --debug
flag which will show you each match result and the proposed match.
This is not kmer based, it's a simple fuzzy grep so I'm confident this is not the issue.
Having more strict parameters should definitely take are of this kind of errors.
What does fastqc say about the data?
Hey @Hoohm, I did a bit of deeper digging and found that the CITE-seq tool is working perfectly fine. The issue I'm seeing is related to late-stage PCR based barcode swapping.
TL;DR Let me walk you through it maybe someone else will find a similar issue and it can be helpful. In the tables above, I went into line 2 that contains the cell barcode
ATTCTATATCCTGATAGCTA
Tag_4 should be the only one that should be present in that cell as evidenced by the high read_counts
prior to UMI correction.
I then went ahead and extracted all the fastq reads that contain that cell barcode
zcat exp1_R1.fastq.gz | grep -B 1 'AAACCCAAGATTGTGA' > cell_barcode_AAACCCAAGATTGTGA.txt
followed by grepping the read identifiers
grep '\@' cell_barcode_AAACCCAAGATTGTGA.txt | perl -pe 's/\s.*//' > read2_accessions_cell_barcode_AAACCCAAGATTGTGA.txt
I then subsetted all the reads that contain tag_5
which is not supposed to be present in that cell.
grep -B 1 'ATTCTATATCCTGATAGCTA' read2_sequence_matched_bc.txt | grep '\@' | perl -pe 's/\s.*//' > tag5_barcode_ids_from_read2.txt
I used those fastq read IDs to go back and extract the cell_barcode + UMI from read1
zcat exp1_R1.fastq.gz | grep -A 1 -f tag5_barcode_ids_from_read2.txt > tag5_read1_cell_barcode_AAACCCAAGATTGTGA.txt
I then checked the unique UMIs that are present for tag_5
to get a sense if that is similar to the output that was reported by CITE-seq tools.
perl -ne '/^[^@-]/ && print $_' tag5_read1_cell_barcode_AAACCCAAGATTGTGA.txt| perl -ne 'my $s = substr $_, -13; print $s' | sort | uniq -c
1 AAAGCGGGCCTT 1 ACATTGGACCTT 3 ACTAGAGTCGGG 2 AGATCCAGCATC 2 ATGAGTATCACG 5 CCACAGGCTCTG 3 CCATCATTTTTT 1 CCGTAAAGTTCC 1 CCTCTTAATTCA 2 CCTGTTATTGGG 1 CGGTGACATAAA 1 CGTATTTCCGCT 1 CTCCGGGTGGAG 1 CTTGCCATTTGG 1 GGAAGAATAGGT 1 GGCGGCCATCTC 2 GGGAAATTGGCT 1 GGTGTTCGACAG 2 GTCCTGGGGAAT 2 TAATACAGTGAT 3 TGCCCAGGGTCA 1 TGCCTCGAGGAG 1 TTGGTATATTTA
23 unique entries
The 23 unique entries are quite similar to the 18 unique entries that were the output from CITE-seq tool

I picked 1 UMI TTGGTATATTTA
and extracted the same cell barcode + UMI combo from read1
grep -B1 'AAACCCAAGATTGTGACCACAGGCTCTG' read1_UMI_CCACAGGCTCTG.txt | grep '\@' | perl -pe 's/\s.*//' > read_ID_BC+UMI_AAACCCAAGATTGTGACCACAGGCTCTG.txt
followed by extracting the matching read2 information
zcat exp_R2.fastq.gz | grep -A 1 -f read_ID_BC+UMI_AAACCCAAGATTGTGACCACAGGCTCTG.txt > read2_identicalUMI_ID_BC+UMI_AAACCCAAGATTGTGACCACAGGCTCTG.txt
Finally, I went back and checked what tags
are associated with the same cell barcode + UMI combo...
perl -ne '/^[^@-]/ && print $_' read2_identicalUMI_ID_BC+UMI_AAACCCAAGATTGTGACCACAGGCTCTG.txt | grep -f tags_DNA_sequence.txt | perl -ne 'my $s = substr $_ , 29, 20; print $s ."\n"' | sort | uniq -c
1 ATCCAAAGTAGGGACTACCT 5 ATTCTATATCCTGATAGCTA 1 CATCCAAAGTAGGGACTACA 3 CATCCAAAGTAGGGACTACC 80 GCATCCAAAGTAGGGACTAC 3 TCAAGAGCTCCCAAGAGTTT
And there I have it, all the tags
I have used come up with 80 counts for tag_4
which is the actual tag present and the rest get the same UMI and cell barcode in later stages of the PCR as I cannot imagine the same BC+UMI combo to be present on all tags
in any given cell.
Now the question is, for any given cell, is there an option to assign the dominant (most counts) UMI+BC combo as the "correct" one and filter all late-stage PCR barcode swaps out? That step would have to be done prior to UMI collapse.
That's some nice digging you did there :)
There is no way to do what you need with CITE-seq-Count especially since the dict that collects counts is based on the cell-barcode as the first key, so your UMI/TAG combos will be distributed everywhere.
But maybe I'm misunderstanding this. Would you like to take 15min to chat about this on discord?
Hey @Hoohm, thanks heaps for your reply. Would love to have a chat with you. I just signed up for discord and my number is cpflueger#4139. Feel free to send me an invite. I am in Australia so would be good to chat sometime in the afternoon/evening when it is your morning / mid-day :)
Hey @cpflueger2016 had a few minutes to spend on this. the switch_branch now works as we discussed.
Whenever there is more than one most_common, It prints out the top 10 most common and their respective count.
The line looks like:
GTTGTAGGTCGGAA,ACTCCTCGCC,tag1-3/tag2-1
Let me know how it goes.
Hey @Hoohm, I have given the switch_mod
branch a go and saved the output to a text file. However, I do not see the output on the command line as you have indicated. Did you commit your changes to the switch_mod
to the repo?
Here is my command that I have run:
nice -n 12 ~/work3/conda_installs/stiletto/miniconda2/envs/cite_seq_switchMOD/bin/CITE-seq-Count \ --cell_barcode_first_base 1 \ --cell_barcode_last_base 16 \ --umi_first_base 17 \ --umi_last_base 28 \ --expected_cells 9000 \ --max-error 4 \ --start-trim 26 \ --sliding-window \ --dense \ --threads 12 \ --no_umi_correction \ --unmapped-tags unmapped.cvs \ --unknown-top-tags 200 \ -R1 RL1953_2019_12_18_pA_BC_R1.fastq.gz \ -R2 RL1953_2019_12_18_pA_BC_R2.fastq.gz \ -t BEP_barcode_tags.csv \ -o pA_barcodes_6xBEP_whitelist_SW_switchMOD_260402020 \ --whitelist whitelist_BEP6x_filtered_cells.txt > output_matrix.txt
Have you updated the code? https://github.com/KirstieJane/STEMMRoleModels/wiki/Syncing-your-fork-to-the-original-repository-via-the-browser
Hey @Hoohm,
I successfully updated the branch with your help and ran the CST from the switch_mod branch. But there seems to be an error:
Traceback (most recent call last): File "/home/cpfluger/work2/bin/cite_seq_count/__main__.py", line 543, in <module> main() File "/home/cpfluger/work2/bin/cite_seq_count/__main__.py", line 464, in main print('{},{},{}'.format(cell_barcode, UMI.decode(), '/'.join('-'.join([i[0], str(i[1])]) for i in most_common))) AttributeError: 'str' object has no attribute 'decode'
Any idea what that could be?
This is strange.
Can you double check that you call it like : CITE-seq-Count -R1 R1.fastq.gz -R2 R2.fastq.gz --tags tags.csv -cbf 1 -cbl 16 -umif 17 -umil 28 -cells 10 -T 1 -o results_switch -n 1000000 --mode most_common_tag --no_umi_correction --dense
Hey @Hoohm,
you're a legend mate! First off, I created a separate conda environment on our server with python3 installed. Then I had to install the branch on our server like that:
pip install git+https://github.com/Hoohm/CITE-seq-Count.git@switch_mod
Others might know that but maybe that is useful info for someone down the line.
Then I ran your pipeline and boy oh boy did it work. Look at just a small screenshot below of how many tags contain the "same" UMI.

The question is how difficult would it be to select only the tags with the highest count per UMI that are not unmapped and then output them to a dense matrix.
In any case, this is huge!
Glad to see that it works :) The unmapped question is a real one actually. Should we just select the top 1, discarding unmapped if it is or get the top 1 TAG + the unmapped? I'll deactivate the STDOUT output, this was just for looking at the data, to check the behavior
On Sun, 3 May 2020 at 10:49, cpflueger2016 [email protected] wrote:
Hey @Hoohm https://github.com/Hoohm,
you're a legend mate! First off, I created a separate conda environment on our server with python3 installed. Then I had to install the branch on our server like that:
pip install git+https://github.com/Hoohm/CITE-seq-Count.git@switch_mod
Others might know that but maybe that is useful for info for someone later on.
Then I ran your pipeline and boy oh boy did it work. Look at just a small screenshot below of how many tags contain the "same" UMI.
[image: tags_with_same_UMI] https://user-images.githubusercontent.com/17895758/80909792-920caa80-8d5d-11ea-9bd8-f2be4a421a4e.png
The question is how difficult would it be to select only the tags with the highest count per UMI that are not unmapped and then output them to a dense matrix.
In any case, this is huge!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/CITE-seq-Count/issues/105#issuecomment-623076578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2GELI2B62G6XKSJLZTRPUVYFANCNFSM4KKRFYXQ .
--
Roelli Patrick Division of Animal Physiology and Immunology TUM School of Life Sciences Weihenstephan Technische Universität München Weihenstephaner Berg 3 85354 Freising GermanyBCF, Swiss Institute of BioinformaticsBâtiment Génopode, Quartier SorgeUniversity of Lausanne1015 LausanneSwitzerland
https://github.com/Hoohm https://github.com/Hoohm
If you could keep the STDOUT output as a "debug" option that would be huge since it is really helpful to understand what is going on regarding cellbarcode/UMI swaps.
And yes, option 1: discard unmapped and keep the top 1 TAG.
Grand stuff Patrick!
I'll take a look later this week. I'm also gonna merge your branch with the latest develop to profit from the latest features.
Hey, were are we on this. Could you run your samples or do you still need something on top of what we did?
Hey @Hoohm, sorry for the late reply. Your implementation has done magic. I'll compile some data and post it here shortly. Essentially, removing the UMI+CB swaps cleans up the data and massively improves the data analysis.
That's great! Happy to hear it
Hi @Hoohm, here is the long-promised update on the switch_mod branch.
I have used the switch_mod branch to parse out the UMI/CB swaps. I parsed it with an R script as I'm not too good with python coding. Just a reminder, this is the head of the barcodes for 6 tags BEFORE running it through your switch_mod branch.

When I use that data to cluster the tags, this is the output I'm getting.
As you can see, each cell has almost all tags present which we already established is nonsense and due to the late-stage PCR amplification artefact.
After running the switch_mod branch and filtering with an R script for:
- The tag with the highest UMI counts is correct and needs to be kept
- If unmapped tags are the highest count, the next highest correct tag will be kept
- Same goes for equal counts in rank 1 and rank 2 of a valid tag and unmapped tag -> highest valid tag is kept
- If the first two ranks have equal counts of valid tags, it discards that UMI/CB combination
The results look like this:

And clustering of that data looks like that
One can clearly see how much cleaner the data looks like. So thanks heaps for implementing that switch_mod branch. The question is, if that filtering that I'm performing with an R script, can be directly implemented in CITE-seq tools?
Here is the R script that I wrote (pls don't judge too hard - I'm still pretty new to this).
read in the CST output matrix
pA_UMI_CST <- read.csv("switch_mod_STDOUT_UMI_bc.txt", header = FALSE, sep = ",", col.names = c("cell_BC", "UMI", "tag_counts"))
split the different "tag_counts" fields into the respective effectors
pA_UMI_CST %>%
separate(tag_counts, c(paste0("tag_",seq(1,8))), sep = "/" ) -> pA_UMI_CST_sep
split the actual counts off of the column tag_1 and tag_2
pA_UMI_CST_sep %>%
separate(tag_1, c("tag_1.POI", "tag_1.counts"), "-") -> pA_UMI_CST_sep
pA_UMI_CST_sep$tag_1.counts <- as.numeric(pA_UMI_CST_sep$tag_1.counts)
pA_UMI_CST_sep %>%
separate(tag_2, c("tag_2.POI", "tag_2.counts"), "-") -> pA_UMI_CST_sep
pA_UMI_CST_sep$tag_2.counts <- as.numeric(pA_UMI_CST_sep$tag_2.counts)
decide on the final tag (based on highest counts) and drop the others
pA_UMI_CST_sep$tag_final.POI <- NA
pA_UMI_CST_sep$tag_final.counts <- as.numeric(NA)
Set the POI to the second tag IF the first tag is unmapped
unmapped_first <- pA_UMI_CST_sep$tag_1.POI == "unmapped"
pA_UMI_CST_sep$tag_final.POI[unmapped_first] <- pA_UMI_CST_sep$tag_2.ePOI[unmapped_first]
pA_UMI_CST_sep$tag_final.counts[unmapped_first] <- pA_UMI_CST_sep$tag_2.counts[unmapped_first]
Set the final protein of interest (POI) to the first tag if it has the highest counts
highest_counts <- pA_UMI_CST_sep$tag_1.counts > pA_UMI_CST_sep$tag_2.counts
pA_UMI_CST_sep$tag_final.POI[highest_counts] <- pA_UMI_CST_sep$tag_1.POI[highest_counts]
pA_UMI_CST_sep$tag_final.counts[highest_counts] <- pA_UMI_CST_sep$tag_1.counts[highest_counts]
If counts tag_1 and tag_2 are equal and tag_2 is unmapped, keep tag_1
same_counts <- pA_UMI_CST_sep$tag_1.counts == pA_UMI_CST_sep$tag_2.counts & pA_UMI_CST_sep$tag_2.POI == "unmapped"
pA_UMI_CST_sep$tag_final.POI[same_counts] <- pA_UMI_CST_sep$tag_1.POI[same_counts]
pA_UMI_CST_sep$tag_final.counts[same_counts] <- pA_UMI_CST_sep$tag_1.counts[same_counts]
Ingnore the rest that have the same exact counts: ~6000 out of 550,000
pA_UMI_CST_final <- pA_UMI_CST_sep[! is.na(pA_UMI_CST_sep$tag_final.counts), c("UMI", "cell_BC", "tag_final.POI", "tag_final.counts")]
put it all together and create the final UMI collapsed matrix
pA_UMI_CST_final %>%
dplyr::count(cell_BC,tag_final.POI) %>%
filter(tag_final.POI != "unmapped") %>%
pivot_wider(names_from = tag_final.POI, values_from = n) -> pA_UMI_CST_final_wide
Replace NA with 0
pA_UMI_CST_final_wide[is.na(pA_UMI_CST_final_wide)] = 0
transpose matrix to fit format for Seurat import
pA_UMI_CST_final_wide <- t(pA_UMI_CST_final_wide)
set column names to "cell barcodes"
colnames(pA_UMI_CST_final_wide) <- pA_UMI_CST_final_wide["cell_BC",]
drop the cell_BC row
pA_UMI_CST_final_wide <- pA_UMI_CST_final_wide[-1,]
Hey @Hoohm, I was running into issues when executing the switch_mod branch on a newer dataset and it always terminates with the following error.

Any idea what be causing this?
I'm doing a lot of changes to the main code right now so it might get fixed on the way with the new release. I'm working on it this week, should be able to help you out.
Hi all, After running cite-seq, I'm suspecting that my data have the same issue reported in this thread. However, after running CITE-seq-Count in switch_mod, I got the same error mentioned by @cpflueger2016 back in August 2020. Do you have any suggestions on how to fix this? Thanks for the help.