CITE-seq-Count icon indicating copy to clipboard operation
CITE-seq-Count copied to clipboard

What is the difference between read_count and uni_count?

Open cpflueger2016 opened this issue 4 years ago • 26 comments

Hey, first of all thank you so much for the incredibly useful and well documented CITE-seq-count tool. I was going through the documentation and also tried to figure out from the code what the difference is between the folders:

read_count and umi_count

I read that both are corrected for PCR duplicates? Any insight would be very helpful. Thanks heaps!

cpflueger2016 avatar Jan 23 '20 06:01 cpflueger2016

Hello @cpflueger2016

Maybe I should write a bit more details about that part. UMI counts are "demultiplexed" reads. Meaning that you get one count per combination of cell + UMI + TAG regardless of how reads of the same combination you get.

Read counts are not collapsed/demultiplexed. So for 1 UMI you might have 10 reads. Usually we use the umi count matrix for analysis.

I provide both because sometimes you want to have an idea of the complexity of your data or maybe protocol troubleshooting.

Hoohm avatar Jan 23 '20 10:01 Hoohm

UMI counts output

UMI_counts

Read counts output

Read_counts

Hey @Hoohm, thank you so much for your reply and the detailed explanation. I was thinking that this what it meant but really wanted to double-check. In fact, I do see a strange/unexpected difference between UMI_counts and Read_counts. I have attached both tables with the same cell barcodes below. Essentially, when looking at read_counts, I see the enrichment of tags (barcodes) I would expect. However, when looking at the UMI counts, it appears much noisier. Given that the RT should have added the UMI and cell barcode at the same step, I was wondering if you guys have seen such a "late" amplification of tags/barcodes from Antibodies that I would more likely classify as noise. I know that github is for bioinformatics analysis but I thought I ask if you have seen that before ;)

In any case, thanks again for this tremendous tool and that you have included the read_counts for troubleshooting. 👍

cpflueger2016 avatar Jan 26 '20 04:01 cpflueger2016

Hey @cpflueger2016 sorry for the late response.

It is a bit strange that you have some tags in some cells that are more amplified. I'm noticing this on tag_6 for cells 14,15,16 which don't behave the same at all.

Does it affect at all the downstream analysis?

Hoohm avatar Mar 29 '20 12:03 Hoohm

Hey @Hoohm, thanks for the followup.

As you have pointed out, I do expect a much stronger amplification of tags in certain cells. These "barcodes" are driven by a promoter to distinguish between the origin of lentiviral cargo. In my analysis, the count table before UMI collapse resembles other readouts that I have also captured most closely.

For example, row 16, anything other than tag_6 is noise. Similarly, row 11 has the highest counts for tag_1 and tag_3 and the rest is noise.

However, that would be almost impossible to conclude from the UMI counts output, even though the highest counts in that row do match. It's just so strange and I cannot quite explain what is going on. I take you haven't seen something like this either.

cpflueger2016 avatar Mar 30 '20 09:03 cpflueger2016

I guess there is something in the protocol that behaves differently than expected. I haven't seen something like it yet no, sorry.

Would you be able to describe the protocol quickly to maybe have an idea of what's happening?

Hoohm avatar Mar 30 '20 09:03 Hoohm

Sure thing. The tags (barcodes) are expressed in a cell after lentiviral transduction. They have a 5' PCR handle to attach the P7 handle by PCR and on the 3' the tags are polyA tailed. They gain the cell barcode and the UMI in the 10x chromium-based single cell library preparation during the RT (reverse transcription) reaction.

It does look to me that the "wrong" assignment of the tags to the cell barcodes could be due to sequencing error or error tolerance in the kmer analysis. I did run your program with 0 mismatches for UMIs and 0 mismatches for cell barcodes though.

Is there a debug output that I could check which read_id (in the fastq file) the tag assignments to the cell barcode are coming from?

cpflueger2016 avatar Apr 02 '20 08:04 cpflueger2016

Yes there is a --debug flag which will show you each match result and the proposed match. This is not kmer based, it's a simple fuzzy grep so I'm confident this is not the issue.

Having more strict parameters should definitely take are of this kind of errors.

What does fastqc say about the data?

Hoohm avatar Apr 09 '20 06:04 Hoohm

Hey @Hoohm, I did a bit of deeper digging and found that the CITE-seq tool is working perfectly fine. The issue I'm seeing is related to late-stage PCR based barcode swapping.

TL;DR Let me walk you through it maybe someone else will find a similar issue and it can be helpful. In the tables above, I went into line 2 that contains the cell barcode

ATTCTATATCCTGATAGCTA

Tag_4 should be the only one that should be present in that cell as evidenced by the high read_counts prior to UMI correction.

I then went ahead and extracted all the fastq reads that contain that cell barcode

zcat exp1_R1.fastq.gz | grep -B 1 'AAACCCAAGATTGTGA' > cell_barcode_AAACCCAAGATTGTGA.txt

followed by grepping the read identifiers

grep '\@' cell_barcode_AAACCCAAGATTGTGA.txt | perl -pe 's/\s.*//' > read2_accessions_cell_barcode_AAACCCAAGATTGTGA.txt

I then subsetted all the reads that contain tag_5 which is not supposed to be present in that cell.

grep -B 1 'ATTCTATATCCTGATAGCTA' read2_sequence_matched_bc.txt | grep '\@' | perl -pe 's/\s.*//' > tag5_barcode_ids_from_read2.txt

I used those fastq read IDs to go back and extract the cell_barcode + UMI from read1

zcat exp1_R1.fastq.gz | grep -A 1 -f tag5_barcode_ids_from_read2.txt > tag5_read1_cell_barcode_AAACCCAAGATTGTGA.txt

I then checked the unique UMIs that are present for tag_5 to get a sense if that is similar to the output that was reported by CITE-seq tools.

perl -ne '/^[^@-]/ && print $_' tag5_read1_cell_barcode_AAACCCAAGATTGTGA.txt| perl -ne 'my $s = substr $_, -13; print $s' | sort | uniq -c

  1 AAAGCGGGCCTT
  1 ACATTGGACCTT
  3 ACTAGAGTCGGG
  2 AGATCCAGCATC
  2 ATGAGTATCACG
  5 CCACAGGCTCTG
  3 CCATCATTTTTT
  1 CCGTAAAGTTCC
  1 CCTCTTAATTCA
  2 CCTGTTATTGGG
  1 CGGTGACATAAA
  1 CGTATTTCCGCT
  1 CTCCGGGTGGAG
  1 CTTGCCATTTGG
  1 GGAAGAATAGGT
  1 GGCGGCCATCTC
  2 GGGAAATTGGCT
  1 GGTGTTCGACAG
  2 GTCCTGGGGAAT
  2 TAATACAGTGAT
  3 TGCCCAGGGTCA
  1 TGCCTCGAGGAG
  1 TTGGTATATTTA

23 unique entries

The 23 unique entries are quite similar to the 18 unique entries that were the output from CITE-seq tool

image

I picked 1 UMI TTGGTATATTTA and extracted the same cell barcode + UMI combo from read1

grep -B1 'AAACCCAAGATTGTGACCACAGGCTCTG' read1_UMI_CCACAGGCTCTG.txt | grep '\@' | perl -pe 's/\s.*//' > read_ID_BC+UMI_AAACCCAAGATTGTGACCACAGGCTCTG.txt

followed by extracting the matching read2 information

zcat exp_R2.fastq.gz | grep -A 1 -f read_ID_BC+UMI_AAACCCAAGATTGTGACCACAGGCTCTG.txt > read2_identicalUMI_ID_BC+UMI_AAACCCAAGATTGTGACCACAGGCTCTG.txt

Finally, I went back and checked what tags are associated with the same cell barcode + UMI combo...

perl -ne '/^[^@-]/ && print $_' read2_identicalUMI_ID_BC+UMI_AAACCCAAGATTGTGACCACAGGCTCTG.txt | grep -f tags_DNA_sequence.txt | perl -ne 'my $s = substr $_ , 29, 20; print $s ."\n"' | sort | uniq -c

  1 ATCCAAAGTAGGGACTACCT
  5 ATTCTATATCCTGATAGCTA
  1 CATCCAAAGTAGGGACTACA
  3 CATCCAAAGTAGGGACTACC
 80 GCATCCAAAGTAGGGACTAC
  3 TCAAGAGCTCCCAAGAGTTT

And there I have it, all the tags I have used come up with 80 counts for tag_4 which is the actual tag present and the rest get the same UMI and cell barcode in later stages of the PCR as I cannot imagine the same BC+UMI combo to be present on all tags in any given cell.

Now the question is, for any given cell, is there an option to assign the dominant (most counts) UMI+BC combo as the "correct" one and filter all late-stage PCR barcode swaps out? That step would have to be done prior to UMI collapse.

cpflueger2016 avatar Apr 13 '20 02:04 cpflueger2016

That's some nice digging you did there :)

There is no way to do what you need with CITE-seq-Count especially since the dict that collects counts is based on the cell-barcode as the first key, so your UMI/TAG combos will be distributed everywhere.

But maybe I'm misunderstanding this. Would you like to take 15min to chat about this on discord?

Hoohm avatar Apr 18 '20 09:04 Hoohm

Hey @Hoohm, thanks heaps for your reply. Would love to have a chat with you. I just signed up for discord and my number is cpflueger#4139. Feel free to send me an invite. I am in Australia so would be good to chat sometime in the afternoon/evening when it is your morning / mid-day :)

cpflueger2016 avatar Apr 20 '20 01:04 cpflueger2016

Hey @cpflueger2016 had a few minutes to spend on this. the switch_branch now works as we discussed.

Whenever there is more than one most_common, It prints out the top 10 most common and their respective count.

The line looks like: GTTGTAGGTCGGAA,ACTCCTCGCC,tag1-3/tag2-1

Let me know how it goes.

Hoohm avatar Apr 25 '20 09:04 Hoohm

Hey @Hoohm, I have given the switch_mod branch a go and saved the output to a text file. However, I do not see the output on the command line as you have indicated. Did you commit your changes to the switch_modto the repo?

output_matrix.txt

Here is my command that I have run:

nice -n 12 ~/work3/conda_installs/stiletto/miniconda2/envs/cite_seq_switchMOD/bin/CITE-seq-Count \ --cell_barcode_first_base 1 \ --cell_barcode_last_base 16 \ --umi_first_base 17 \ --umi_last_base 28 \ --expected_cells 9000 \ --max-error 4 \ --start-trim 26 \ --sliding-window \ --dense \ --threads 12 \ --no_umi_correction \ --unmapped-tags unmapped.cvs \ --unknown-top-tags 200 \ -R1 RL1953_2019_12_18_pA_BC_R1.fastq.gz \ -R2 RL1953_2019_12_18_pA_BC_R2.fastq.gz \ -t BEP_barcode_tags.csv \ -o pA_barcodes_6xBEP_whitelist_SW_switchMOD_260402020 \ --whitelist whitelist_BEP6x_filtered_cells.txt > output_matrix.txt

cpflueger2016 avatar Apr 29 '20 01:04 cpflueger2016

Have you updated the code? https://github.com/KirstieJane/STEMMRoleModels/wiki/Syncing-your-fork-to-the-original-repository-via-the-browser

Hoohm avatar Apr 29 '20 07:04 Hoohm

Hey @Hoohm,

I successfully updated the branch with your help and ran the CST from the switch_mod branch. But there seems to be an error:

Traceback (most recent call last): File "/home/cpfluger/work2/bin/cite_seq_count/__main__.py", line 543, in <module> main() File "/home/cpfluger/work2/bin/cite_seq_count/__main__.py", line 464, in main print('{},{},{}'.format(cell_barcode, UMI.decode(), '/'.join('-'.join([i[0], str(i[1])]) for i in most_common))) AttributeError: 'str' object has no attribute 'decode'

Any idea what that could be?

cpflueger2016 avatar Apr 30 '20 13:04 cpflueger2016

This is strange. Can you double check that you call it like : CITE-seq-Count -R1 R1.fastq.gz -R2 R2.fastq.gz --tags tags.csv -cbf 1 -cbl 16 -umif 17 -umil 28 -cells 10 -T 1 -o results_switch -n 1000000 --mode most_common_tag --no_umi_correction --dense

Hoohm avatar May 01 '20 09:05 Hoohm

Hey @Hoohm,

you're a legend mate! First off, I created a separate conda environment on our server with python3 installed. Then I had to install the branch on our server like that:

pip install git+https://github.com/Hoohm/CITE-seq-Count.git@switch_mod

Others might know that but maybe that is useful info for someone down the line.

Then I ran your pipeline and boy oh boy did it work. Look at just a small screenshot below of how many tags contain the "same" UMI.

tags_with_same_UMI

The question is how difficult would it be to select only the tags with the highest count per UMI that are not unmapped and then output them to a dense matrix.

In any case, this is huge!

cpflueger2016 avatar May 03 '20 08:05 cpflueger2016

Glad to see that it works :) The unmapped question is a real one actually. Should we just select the top 1, discarding unmapped if it is or get the top 1 TAG + the unmapped? I'll deactivate the STDOUT output, this was just for looking at the data, to check the behavior

On Sun, 3 May 2020 at 10:49, cpflueger2016 [email protected] wrote:

Hey @Hoohm https://github.com/Hoohm,

you're a legend mate! First off, I created a separate conda environment on our server with python3 installed. Then I had to install the branch on our server like that:

pip install git+https://github.com/Hoohm/CITE-seq-Count.git@switch_mod

Others might know that but maybe that is useful for info for someone later on.

Then I ran your pipeline and boy oh boy did it work. Look at just a small screenshot below of how many tags contain the "same" UMI.

[image: tags_with_same_UMI] https://user-images.githubusercontent.com/17895758/80909792-920caa80-8d5d-11ea-9bd8-f2be4a421a4e.png

The question is how difficult would it be to select only the tags with the highest count per UMI that are not unmapped and then output them to a dense matrix.

In any case, this is huge!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/CITE-seq-Count/issues/105#issuecomment-623076578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2GELI2B62G6XKSJLZTRPUVYFANCNFSM4KKRFYXQ .

--

Roelli Patrick Division of Animal Physiology and Immunology TUM School of Life Sciences Weihenstephan Technische Universität München Weihenstephaner Berg 3 85354 Freising GermanyBCF, Swiss Institute of BioinformaticsBâtiment Génopode, Quartier SorgeUniversity of Lausanne1015 LausanneSwitzerland

https://github.com/Hoohm https://github.com/Hoohm

Hoohm avatar May 03 '20 09:05 Hoohm

If you could keep the STDOUT output as a "debug" option that would be huge since it is really helpful to understand what is going on regarding cellbarcode/UMI swaps.

And yes, option 1: discard unmapped and keep the top 1 TAG.

Grand stuff Patrick!

cpflueger2016 avatar May 03 '20 10:05 cpflueger2016

I'll take a look later this week. I'm also gonna merge your branch with the latest develop to profit from the latest features.

Hoohm avatar May 03 '20 17:05 Hoohm

Hey, were are we on this. Could you run your samples or do you still need something on top of what we did?

Hoohm avatar Jun 28 '20 17:06 Hoohm

Hey @Hoohm, sorry for the late reply. Your implementation has done magic. I'll compile some data and post it here shortly. Essentially, removing the UMI+CB swaps cleans up the data and massively improves the data analysis.

cpflueger2016 avatar Jul 03 '20 11:07 cpflueger2016

That's great! Happy to hear it

Hoohm avatar Jul 05 '20 11:07 Hoohm

Hi @Hoohm, here is the long-promised update on the switch_mod branch.

I have used the switch_mod branch to parse out the UMI/CB swaps. I parsed it with an R script as I'm not too good with python coding. Just a reminder, this is the head of the barcodes for 6 tags BEFORE running it through your switch_mod branch.

pA_barcodes_before_filtering

When I use that data to cluster the tags, this is the output I'm getting.

Clustering_pA_barcodes_before_UMI_swap_removal

As you can see, each cell has almost all tags present which we already established is nonsense and due to the late-stage PCR amplification artefact.

After running the switch_mod branch and filtering with an R script for:

  1. The tag with the highest UMI counts is correct and needs to be kept
  2. If unmapped tags are the highest count, the next highest correct tag will be kept
  3. Same goes for equal counts in rank 1 and rank 2 of a valid tag and unmapped tag -> highest valid tag is kept
  4. If the first two ranks have equal counts of valid tags, it discards that UMI/CB combination

The results look like this:

pA_barcodes_after_UMI_BC_swap_removed

And clustering of that data looks like that

Clustering_pA_barcodes_after_UMI_swap_removal

One can clearly see how much cleaner the data looks like. So thanks heaps for implementing that switch_mod branch. The question is, if that filtering that I'm performing with an R script, can be directly implemented in CITE-seq tools?

Here is the R script that I wrote (pls don't judge too hard - I'm still pretty new to this).

read in the CST output matrix

pA_UMI_CST <- read.csv("switch_mod_STDOUT_UMI_bc.txt", header = FALSE, sep = ",", col.names = c("cell_BC", "UMI", "tag_counts"))

split the different "tag_counts" fields into the respective effectors

pA_UMI_CST %>% separate(tag_counts, c(paste0("tag_",seq(1,8))), sep = "/" ) -> pA_UMI_CST_sep

split the actual counts off of the column tag_1 and tag_2

pA_UMI_CST_sep %>% separate(tag_1, c("tag_1.POI", "tag_1.counts"), "-") -> pA_UMI_CST_sep

pA_UMI_CST_sep$tag_1.counts <- as.numeric(pA_UMI_CST_sep$tag_1.counts)

pA_UMI_CST_sep %>% separate(tag_2, c("tag_2.POI", "tag_2.counts"), "-") -> pA_UMI_CST_sep pA_UMI_CST_sep$tag_2.counts <- as.numeric(pA_UMI_CST_sep$tag_2.counts)

decide on the final tag (based on highest counts) and drop the others

pA_UMI_CST_sep$tag_final.POI <- NA pA_UMI_CST_sep$tag_final.counts <- as.numeric(NA)

Set the POI to the second tag IF the first tag is unmapped

unmapped_first <- pA_UMI_CST_sep$tag_1.POI == "unmapped" pA_UMI_CST_sep$tag_final.POI[unmapped_first] <- pA_UMI_CST_sep$tag_2.ePOI[unmapped_first] pA_UMI_CST_sep$tag_final.counts[unmapped_first] <- pA_UMI_CST_sep$tag_2.counts[unmapped_first]

Set the final protein of interest (POI) to the first tag if it has the highest counts

highest_counts <- pA_UMI_CST_sep$tag_1.counts > pA_UMI_CST_sep$tag_2.counts pA_UMI_CST_sep$tag_final.POI[highest_counts] <- pA_UMI_CST_sep$tag_1.POI[highest_counts] pA_UMI_CST_sep$tag_final.counts[highest_counts] <- pA_UMI_CST_sep$tag_1.counts[highest_counts]

If counts tag_1 and tag_2 are equal and tag_2 is unmapped, keep tag_1

same_counts <- pA_UMI_CST_sep$tag_1.counts == pA_UMI_CST_sep$tag_2.counts & pA_UMI_CST_sep$tag_2.POI == "unmapped" pA_UMI_CST_sep$tag_final.POI[same_counts] <- pA_UMI_CST_sep$tag_1.POI[same_counts] pA_UMI_CST_sep$tag_final.counts[same_counts] <- pA_UMI_CST_sep$tag_1.counts[same_counts]

Ingnore the rest that have the same exact counts: ~6000 out of 550,000

pA_UMI_CST_final <- pA_UMI_CST_sep[! is.na(pA_UMI_CST_sep$tag_final.counts), c("UMI", "cell_BC", "tag_final.POI", "tag_final.counts")]

put it all together and create the final UMI collapsed matrix

pA_UMI_CST_final %>% dplyr::count(cell_BC,tag_final.POI) %>% filter(tag_final.POI != "unmapped") %>% pivot_wider(names_from = tag_final.POI, values_from = n) -> pA_UMI_CST_final_wide

Replace NA with 0

pA_UMI_CST_final_wide[is.na(pA_UMI_CST_final_wide)] = 0

transpose matrix to fit format for Seurat import

pA_UMI_CST_final_wide <- t(pA_UMI_CST_final_wide)

set column names to "cell barcodes"

colnames(pA_UMI_CST_final_wide) <- pA_UMI_CST_final_wide["cell_BC",]

drop the cell_BC row

pA_UMI_CST_final_wide <- pA_UMI_CST_final_wide[-1,]

cpflueger2016 avatar Jul 14 '20 03:07 cpflueger2016

Hey @Hoohm, I was running into issues when executing the switch_mod branch on a newer dataset and it always terminates with the following error.

switch_mod_error

Any idea what be causing this?

cpflueger2016 avatar Aug 16 '20 02:08 cpflueger2016

I'm doing a lot of changes to the main code right now so it might get fixed on the way with the new release. I'm working on it this week, should be able to help you out.

Hoohm avatar Aug 17 '20 16:08 Hoohm

Hi all, After running cite-seq, I'm suspecting that my data have the same issue reported in this thread. However, after running CITE-seq-Count in switch_mod, I got the same error mentioned by @cpflueger2016 back in August 2020. Do you have any suggestions on how to fix this? Thanks for the help.

amagli avatar May 21 '21 20:05 amagli