Syntax for 3-reads, paired end technology?
Hello, I am processing some DBiT-seq data (but the I guess the same applies to any combinatorial barcoding strategy) in which the R2 is composed by a spaced cell barcode, like this
# GAAGCGTTGGCTTCTCGCATCT CAACCACA ATCCACGTGCTTGAGAGGCCAGAGCATTCG ACATTGGC GTGGCCGATGTTTCGCATCGGCGTACGA CTTAGTGGGT ATTTTTTTTTTTTTTTGTTTATGGGGTTTTTTTTGGTTTTTCGAG
# ---------------------- 22222222 ------------------------------ 11111111 ---------------------------- UUUUUUUUUU TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
I have to preprocess these reads to have a single cell barcode 1111111122222222 (unless kb supports syntax for split barcodes…) and produce a fastq file with barcode and umi. So far so good, I can feed kb count with -x 1,0,16:1,16,0:0,0,0 file_R1.fastq.gz file_processed.fastq.gz and everything works.
I realized that in many transcripts the portion after the UMI may contain informative mRNA beyond the poly-A, so I started collecting the trimmed R2 as well. I have two options here: either I collate the BC:UMI to R2 or I produce three files (R1, BC, R2). In either case I can't build the proper technology string and I always get an exception. In case of three files I've tried --parity paired -x '1,0,16:1,16,26:0,0,0 2,0,0' R1.fastq.gz BC.fastq.gz R2.fastq.gz and I get
2024-05-17 08:43:59,864] ERROR [main] An exception occurred
Traceback (most recent call last):
File "/home/cittaro.davide/miniforge3/envs/sc02/lib/python3.10/site-packages/kb_python/main.py", line 1618, in main
COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=temp_dir)
File "/home/cittaro.davide/miniforge3/envs/sc02/lib/python3.10/site-packages/kb_python/main.py", line 703, in parse_count
count(
File "/home/cittaro.davide/miniforge3/envs/sc02/lib/python3.10/site-packages/ngs_tools/logging.py", line 62, in inner
return func(*args, **kwargs)
File "/home/cittaro.davide/miniforge3/envs/sc02/lib/python3.10/site-packages/kb_python/count.py", line 1279, in count
bus_result = kallisto_bus(
File "/home/cittaro.davide/miniforge3/envs/sc02/lib/python3.10/site-packages/kb_python/count.py", line 203, in kallisto_bus
run_executable(command)
File "/home/cittaro.davide/miniforge3/envs/sc02/lib/python3.10/site-packages/kb_python/dry/__init__.py", line 25, in inner
return func(*args, **kwargs)
File "/home/cittaro.davide/miniforge3/envs/sc02/lib/python3.10/site-packages/kb_python/utils.py", line 203, in run_executable
raise sp.CalledProcessError(p.returncode, ' '.join(command))
In case of two files (with collated BC and R2) I've used -x '1,0,16:1,16,26:0,0,0 1,26,0' R1.fastq.gz BCR2.fastq.gz with the same error. I suspect it is raised by the additional (spaced) specification for R2 in the string. I've looked at the tech in kb --list, it seems I should be able to specify multiple reads for the :seq , but how should I write it?
Addendum: I see that the specification of barcodes in SURECELL is effectively a split barcode: how can this be passed to -x option?
There should never be a space in the -x string.
For split barcodes (or split UMI or split biological reads), see the instructions on page 7 of https://www.biorxiv.org/content/10.1101/2023.11.21.568164v2.full.pdf
Thanks, that's clear!