FLAMES icon indicating copy to clipboard operation
FLAMES copied to clipboard

Demultiplexing

Open callumparr opened this issue 4 years ago • 12 comments

Can this pipeline also demultiplex reads from cell barcodes?

callumparr avatar Jun 23 '20 06:06 callumparr

Yes I have a C++ script for doing that but havnt integrate it into the main pipeline yet, so not in current repository, I will put it ASAP. Thanks for your insterest.

LuyiTian avatar Jun 23 '20 07:06 LuyiTian

Yes I have a C++ script for doing that but havnt integrate it into the main pipeline yet, so not in current repository, I will put it ASAP. Thanks for your insterest.

Great thank you that would be a big help!

callumparr avatar Jun 23 '20 10:06 callumparr

Hi LuyiTian, trying to compile your match_cell_barcode C++ script using the code you suggested g++ -std=c++11 -lz -O2 -o match_cell_barcode ssw/ssw_cpp.cpp ssw/ssw.c match_cell_barcode.cpp kseq.h edit_dist.cpp I get this error g++: error: edit_dist.cpp: No such file or directory

The edit_dist.cpp file seems to be missing or is it my fault? Thank you in advance for your help 😊

nofre03 avatar Aug 19 '20 07:08 nofre03

@nofre03 I have uploaded the edit_dist.cpp file.

@callumparr I have uploaded the demultiplexing code. feel free to ask if you have any problems.

LuyiTian avatar Aug 21 '20 03:08 LuyiTian

Thank you very much @LuyiTian 😊🤗👋

nofre03 avatar Aug 22 '20 12:08 nofre03

@LuyiTian thank you for uploading. Will give it a go. Thanks again.

callumparr avatar Aug 30 '20 08:08 callumparr

@LuyiTian thank you for uploading. Will give it a go. Thanks again.

Cheers~ Also feel free to ask questions about downstream analysis. Currently I am tring to put some scripts together, but there is no common sense what should be done for downstream analysis so I am happy to hear what others want to do on their dataset.

LuyiTian avatar Aug 31 '20 04:08 LuyiTian

Hi @LuyiTian, I am getting a segmentation error in using the script and wondering if I am running the script correctly. Here is what I used and the error:

match_cell_barcode /EX0128/ barcode1.stat barcode01.merged_porechop-trimmed.fastq EX0128_barcodes.tsv 100

set UMI length to 10.

First 5 cell barcode:

	AAACCCAAGAAACACT

	AAACCCAAGAAACCAT

	AAACCCAAGAAACCCA

	AAACCCAAGAAACCCG

	AAACCCAAGAAACCTG

barcode01.merged_porechop-trimmed.fastq

forward flanking end: 17	401

forward flanking end: 16	336

forward flanking end: 19	318

forward flanking end: 15	306

forward flanking end: 18	293

forward flanking end: 14	160

forward flanking end: 21	147

forward flanking end: 39	140

forward flanking end: 37	140

forward flanking end: 38	137

forward flanking end: 20	133

forward flanking end: 36	131

forward flanking end: 41	120

forward flanking end: 35	119

forward flanking end: 40	116

forward flanking end: 42	110

forward flanking end: 45	88

forward flanking end: 44	88

forward flanking end: 13	87

forward flanking end: 34	87

Segmentation fault (core dumped)

aheravi avatar Oct 14 '20 20:10 aheravi

The max edit distance is the maximum edit distance allowed when matching the cell barcode. since the cell barcode is 16bp the distance should be no more than 16, and to get meaningful result we usually set it no more than 4. So I think 100 is too large. you can set it to 2 or 3

LuyiTian avatar Oct 14 '20 23:10 LuyiTian

Hi @LuyiTian, I have a question about the "Nanopore sequencing and data preprocessing" description, in Methods chapter of your FLAMES paper. As far I understand from this paragraph

For each read, we locate the 'barcode sequence' by searching for the flanking sequence before the 'cell barcode'. The 'cell barcodes' identified from the short-read data provide a reference to search for and trim in the long reads. An edit distance of up to 2 is allowed during cell barcode matching. Reads that failed to match any cell barcode were discarded. Sequences following the cell barcode were used as 'UMIs' and trimmed.

you are considering, for each read, 3 different "barcodes" sequences:

  1. a generic barcode sequence
  2. the cell barcode
  3. the UMI. I know about the last two, but I didn't understand what you are meaning for the first, that is located "before the cell barcode". Could you clarify it, please? Thank you a lot for your attention.

nofre03 avatar Oct 20 '20 14:10 nofre03

@nofre03 The first sequence is not really a barcode. it is part of the primer used by 10x. You can check the read structure of 10x here: https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium3.html. It also tells you the 10x v3 chemistry contains 12bp UMI in the primer, not 10bp.

LuyiTian avatar Oct 27 '20 02:10 LuyiTian

Hi Luyi, Thanks for FLAMES! We're been using the cell barcode matching code for fusion calling in long read single cell data (with JAFFAL) and I've noticed that some fusions seem to have more difficult cell barcode assignment for some reason. So I have a few follow up questions to those posted by others above:

  • Do you allow mismatches in the 10x primer sequence (ie is the edit distance for sequences 1. and 2. or just 2. above)?
  • Are indels allowed (ie counted towards the edit distance in the same way as SNPs)?
  • Do you search for a polyA(T) tail after the UMI?
  • Do you search the reverse compliment at the end of the read as well as the start?
  • Is there a maximum length at the start/end of the read where you stop searching?
  • For the reads where barcodes can't be assigned: do you think these are mostly fragmented molecules where the barcodes are missing from the sequence? or are they more likely to be a case of the barcode having too many errors.

Many thanks. Cheers, Nadia.

nadiadavidson avatar Jun 11 '21 08:06 nadiadavidson