FLAMES
FLAMES copied to clipboard
Demultiplexing
Can this pipeline also demultiplex reads from cell barcodes?
Yes I have a C++ script for doing that but havnt integrate it into the main pipeline yet, so not in current repository, I will put it ASAP. Thanks for your insterest.
Yes I have a C++ script for doing that but havnt integrate it into the main pipeline yet, so not in current repository, I will put it ASAP. Thanks for your insterest.
Great thank you that would be a big help!
Hi LuyiTian, trying to compile your match_cell_barcode C++ script using the code you suggested
g++ -std=c++11 -lz -O2 -o match_cell_barcode ssw/ssw_cpp.cpp ssw/ssw.c match_cell_barcode.cpp kseq.h edit_dist.cpp
I get this error
g++: error: edit_dist.cpp: No such file or directory
The edit_dist.cpp file seems to be missing or is it my fault? Thank you in advance for your help 😊
@nofre03 I have uploaded the edit_dist.cpp
file.
@callumparr I have uploaded the demultiplexing code. feel free to ask if you have any problems.
Thank you very much @LuyiTian 😊🤗👋
@LuyiTian thank you for uploading. Will give it a go. Thanks again.
@LuyiTian thank you for uploading. Will give it a go. Thanks again.
Cheers~ Also feel free to ask questions about downstream analysis. Currently I am tring to put some scripts together, but there is no common sense what should be done for downstream analysis so I am happy to hear what others want to do on their dataset.
Hi @LuyiTian, I am getting a segmentation error in using the script and wondering if I am running the script correctly. Here is what I used and the error:
match_cell_barcode /EX0128/ barcode1.stat barcode01.merged_porechop-trimmed.fastq EX0128_barcodes.tsv 100
set UMI length to 10.
First 5 cell barcode:
AAACCCAAGAAACACT
AAACCCAAGAAACCAT
AAACCCAAGAAACCCA
AAACCCAAGAAACCCG
AAACCCAAGAAACCTG
barcode01.merged_porechop-trimmed.fastq
forward flanking end: 17 401
forward flanking end: 16 336
forward flanking end: 19 318
forward flanking end: 15 306
forward flanking end: 18 293
forward flanking end: 14 160
forward flanking end: 21 147
forward flanking end: 39 140
forward flanking end: 37 140
forward flanking end: 38 137
forward flanking end: 20 133
forward flanking end: 36 131
forward flanking end: 41 120
forward flanking end: 35 119
forward flanking end: 40 116
forward flanking end: 42 110
forward flanking end: 45 88
forward flanking end: 44 88
forward flanking end: 13 87
forward flanking end: 34 87
Segmentation fault (core dumped)
The max edit distance is the maximum edit distance allowed when matching the cell barcode. since the cell barcode is 16bp the distance should be no more than 16, and to get meaningful result we usually set it no more than 4. So I think 100 is too large. you can set it to 2 or 3
Hi @LuyiTian, I have a question about the "Nanopore sequencing and data preprocessing" description, in Methods chapter of your FLAMES paper. As far I understand from this paragraph
For each read, we locate the 'barcode sequence' by searching for the flanking sequence before the 'cell barcode'. The 'cell barcodes' identified from the short-read data provide a reference to search for and trim in the long reads. An edit distance of up to 2 is allowed during cell barcode matching. Reads that failed to match any cell barcode were discarded. Sequences following the cell barcode were used as 'UMIs' and trimmed.
you are considering, for each read, 3 different "barcodes" sequences:
- a generic barcode sequence
- the cell barcode
- the UMI. I know about the last two, but I didn't understand what you are meaning for the first, that is located "before the cell barcode". Could you clarify it, please? Thank you a lot for your attention.
@nofre03 The first sequence is not really a barcode. it is part of the primer used by 10x. You can check the read structure of 10x here: https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium3.html. It also tells you the 10x v3 chemistry contains 12bp UMI in the primer, not 10bp.
Hi Luyi, Thanks for FLAMES! We're been using the cell barcode matching code for fusion calling in long read single cell data (with JAFFAL) and I've noticed that some fusions seem to have more difficult cell barcode assignment for some reason. So I have a few follow up questions to those posted by others above:
- Do you allow mismatches in the 10x primer sequence (ie is the edit distance for sequences 1. and 2. or just 2. above)?
- Are indels allowed (ie counted towards the edit distance in the same way as SNPs)?
- Do you search for a polyA(T) tail after the UMI?
- Do you search the reverse compliment at the end of the read as well as the start?
- Is there a maximum length at the start/end of the read where you stop searching?
- For the reads where barcodes can't be assigned: do you think these are mostly fragmented molecules where the barcodes are missing from the sequence? or are they more likely to be a case of the barcode having too many errors.
Many thanks. Cheers, Nadia.