kallisto icon indicating copy to clipboard operation
kallisto copied to clipboard

Running kallisto(bustools) on amplicon-based (targeted panel) BD Rhapsody scRNA-seq data

Open ahmadalajami opened this issue 2 years ago • 10 comments

Hi,

I am trying to run an amplicon-based (not WTA; targeted panel) BD Rhapsody scRNA-seq data using kallisto/kallistobustools. However, I can see that only BDWTA is supported.

I used the cDNA FASTA file acquired from BD to build the kallisto index using kallisto index command as the one from kallistobustools (namely kb ref) expects a GTF file that I do not have. However, in order to quantify the dataset and return a count matrix, I am expected to provide a T2G file.

My question is, is there a way in kallistobustools to deal with amplicon-based BD data? If I were to use kb ref, what is the easiest way to create a 'dummy' GTF file? Or looking at it differently, is there a way to create a 'dummy' T2G file to use with kb count?

The FASTA ID from BD's cDNA file looks like the following: >ADA|NM_000022.3|Reference_end location: chr20(-): 44619524 site_id: 194282, AMPLICON

Thank you in advance!

ahmadalajami avatar Feb 28 '22 11:02 ahmadalajami

The kb ref command is designed to build a cDNA fasta when provided with a raw genome fasta. As you already have the cDNA fasta, the "kallisto index" command is appropriate to use.

Ideally, we'd make the t2g file optional in bustools (that t2g file is really only useful when you want to do gene-level quantification but each gene could be composed of multiple transcripts), but for now, you can create a dummy t2g file.

The t2g file is a tab-separated values file that is constructed such that the first column is transcript names and the second column is gene names.

In your case, I'd just populate both columns of the t2g file with the FASTA IDs. Let me know if you run any issues.

Be aware that if any of your reads map to multiple amplicons, those reads will be discarded.

Yenaled avatar Feb 28 '22 21:02 Yenaled

Hi @Yenaled,

Thank you very much for your quick reply - was very helpful!

I managed to run it and output the count matrix, however, one thing I do not really understand is what --filter bustools in kb count is doing. So how is filtering done exactly? I could not find relevant information (apologies if I have missed it somewhere).

The reason I am asking is because I also ran Seven Bridges (the software recommend by BD to quantify the abundances of target sequences) on the same dataset. The output returned by Seven Bridges had ~2,500 cells more (out of ~35,000). Most of these cells were present in kb's unfiltered matrix, but none in the filtered one.

ahmadalajami avatar Mar 22 '22 14:03 ahmadalajami

See here:

https://github.com/pachterlab/kb_python/issues/68

Essentially, a second round of bustools correct is being run. Consider this mostly an experimental feature -- I see no reason to do this in almost all cases (I myself have never used it). Just use the unfiltered matrix.

Yenaled avatar Mar 22 '22 19:03 Yenaled

Also, unrelatedly, I know BD Rhapsody WTA comes with a whitelist (it's essentially three different whitelists) and it should probably work the same for your version of BD Rhapsody. See here:

https://teichlab.github.io/scg_lib_structs/data/BD_CLS1.txt https://teichlab.github.io/scg_lib_structs/data/BD_CLS2.txt https://teichlab.github.io/scg_lib_structs/data/BD_CLS3.txt

Consider supplying that to the --whitelist option in kb count (see attached file which has every combination of these three whitelists stitched together). We'll hopefully automatically include this in kb count in a future release but let me know what results you get with this attached file (simply unzip the attached BD_WTA_whitelist_final.txt.gz file and put it in the --whitelist option of kb count).

BD_WTA_whitelist_final.txt.gz

Yenaled avatar Mar 22 '22 19:03 Yenaled

Thank you for the information.

I have obtained the whitelist from BD before and been trying to run kb using it from the beginning.

I used the unfiltered matrix now but the number of cells returned is too high: Seven Bridges: ~37,000 counts_filtered/cells_x_genes.mtx: ~35,000 counts_unfiltered/cells_x_genes.mtx: ~800,000

The commands that I have been using are the following: kallisto index -i index.idx cDNA.fasta kb count -i index.idx -g t2g.txt -x BDWTA -w whitelist.txt --filter bustools -o output fastqs

t2g.txt was created as per the previous discussion.

ahmadalajami avatar Mar 23 '22 13:03 ahmadalajami

The whitelist I attached previously has ~900,000 entries with each barcode being 27 bp long -- is your whitelist.txt the same as or similar to the one I attached?

In any case, I don't know how seven bridges works, but with kb, you need to load the matrix in, say, python, and create a knee plot to filter out the bad cells. See the kallisto bustools tutorial. After that, you should end up with a smaller number of cells. (The --filter bustools is supposed to give similar results to the knee plot, but the knee plot is generally a better way to filter out bad barcodes, e.g. cells that have too few UMIs).

Yenaled avatar Mar 23 '22 13:03 Yenaled

Similar, yes, but not exactly the same number of entries. Each barcode is indeed 27bp long. I am attaching my whitelist.txt for the sake of completeness.

I will filter out the bad cells by creating a knee plot from now on. Thank you!

whitelist.tar.gz

ahmadalajami avatar Mar 23 '22 13:03 ahmadalajami

Great! Thanks, I think your whitelist only has 96x96x96 entries. The BD whitelist that I attached has 97x97x97 entries but everything in your whitelist is in the whitelist I attached (your whitelist is just missing some entries).

Regardless, the primary issue is just making sure you filter out barcodes using a knee plot. :) Feel free to post again if you have any further questions.

Yenaled avatar Mar 23 '22 13:03 Yenaled

Yes, that's exactly what I noticed as well.

It's hard to know how Seven Bridges does filtering as it is not open source. Anyhow, moving forward, I will use a knee plot on the unfiltered matrix obtained from kb to filter out barcodes. Thanks a lot for all your help! :)

ahmadalajami avatar Mar 23 '22 13:03 ahmadalajami

Hi @Yenaled,

I thought you might be interested to see this: I examined the knee plot of kb's unfiltered matrix and accordingly, retained only those with > 1000 total UMI count (call this x). I then found the 'unique' cells of kb's filtered matrix (from --filter bustools) compared to x (left plot) and similarly, those of Seven Bridges compared to x (right plot).

So it seems like there is no additional magic filtering done by Seven Bridges, but rather a matter of setting the cut-threshold.

image

ahmadalajami avatar Mar 30 '22 12:03 ahmadalajami