kallisto
kallisto copied to clipboard
Barcode length is changed in 10xv1
I tried to process a the pbmc3k dataset from 10x v1 chemistry with kallisto bus (version 0.46.0), and got into some problems. Since the dataset is is a different format from the format from Cell Ranger 1.1+, I downloaded the bam file and converted it into fastq files with bamtofastq
. I got R1, which is the 98 nt read, R2, which has read length 14, so I think it's the barcode, and R3, which has read length 10, so I think it's the UMI. Here is the code used to process the fastq files:
cd ./data/pbmc3k_fastqs
kallisto bus -i ../../output/hs_tr_index.idx -o . -x 10xv1 -t8 \
bamtofastq_S1_L000_R1_001.fastq bamtofastq_S1_L000_R3_001.fastq \
bamtofastq_S1_L000_R2_001.fastq \
bamtofastq_S1_L000_R1_002.fastq bamtofastq_S1_L000_R3_002.fastq \
bamtofastq_S1_L000_R2_002.fastq \
bamtofastq_S1_L000_R1_003.fastq bamtofastq_S1_L000_R3_003.fastq \
bamtofastq_S1_L000_R2_003.fastq \
bamtofastq_S1_L000_R1_004.fastq bamtofastq_S1_L000_R3_004.fastq \
bamtofastq_S1_L000_R2_004.fastq
I think there is a v1 whitelist that comes with Cell Ranger, called 737K-april-2014_rc.txt
(correct me if I'm wrong). Then I tried to do barcode error correction with bustools
:
bustools correct -w ../../data/whitelist_v1.txt -p output.bus | \
bustools sort -o output.correct.sort.bus -t4 -
And got this error:
Error: barcode length and whitelist length differ, barcodes = 12, whitelist = 14
check that your whitelist matches the technology used
However, I checked the fastq file, and the barcode length is 14. I wonder how it became 12. For another 10xv1 dataset, I skipped error correction (because of this error) and went all the way to get the gene count matrices. Again, I checked the fastq file that the barcode length is 14, and used awk to check the average length, which also returned 14. Then the barcode became 16 nt for one matrix and 12 for another. When the barcode becomes 12 nt long, duplicate barcodes appear in different columns of the matrix; perhaps the truncation happens after collapsing UMIs for otherwise there won't be duplicates. When the barcode becomes 16 nt long, it seems that 2 A's are added to the beginning of each barcode and there are not duplicates. What's more worrisome is that different runs with the same data can result into different barcode length.
A way to work around this: replace the 10xv1
with 2,0,14:1,0,10:0,0,0
. Meaning of the strings separated by colons: x,y,z
, x
means which file listed below (0 based indexing), so the first file would be 0, the second would be 1. y
means which base does this feature of interest start, again, 0 based, so if say the barcode or UMI starts at the first base, then it should be 0. z
means which base does this feature of interest end. For instance, if the barcode is 16 nt long, then it should be 16. If both y
and z
are 0, then all bases in the read are used. The first x,y,z
is for barcode. The second is for UMI, and the third is for the biological sequence. However, the problem with directly using 10xv1
should still be fixed.
regarding the -x
flag...
I am using MARSseq chemistry which is:
- R1: sequence read (length 66)
- R2: plate_barcode (4pb)-cell_barcode(7bp)-UMI(8bp)
my -x
flag was as follows: 1,0,11:1,12,19:0,0,0
but I am getting the following error message:
Error: barcode length and whitelist length differ, barcodes = 11, whitelist = 12
If I understand correctly, the barcodes = 11
is the length of the barcode from the output.bus
file. And the whitelist = 12
is my barcode list.
I just exported the whitelist from the bus file using bustools whitelist -o ./bus_whitelist.txt output.bus
and checked the length of the barcodes using $ wc -L bus_whitelist.txt
and got 11 bus_whitelist.txt
I did the same for my barcode list using $ wc -L ../data/human5_plate_mars_merge_barcodes.txt
and I got 11 ../data/human5_plate_mars_merge_barcodes.txt
I have both my barcode list and the whitelist from the bus file in the same length, why I am getting Error: barcode length and whitelist length differ, barcodes = 11, whitelist = 12
Thanks, HM
Use 1,0,12
for barcode
Dear @lambdamoses Any suggestion to solve the above issue? I think this is the source of other bugs that I have in the down stream analysis. Best HM
This is what the kallisto documentation says:
Additionally kallisto bus will accept a string specifying a new technology in the format of bc:umi:seq where each of bc,umi and seq are a triplet of integers separated by a comma, denoting the file index, start and stop of the sequence used. For example to specify the 10xV2 technology we would use 0,0,16:0,16,26:1,0,0. The first part bc is 0,0,16 indicating it is in the 0-th file (also known as the first file in plain english), the barcode starts at the 0-th bp and ends at the 16-th bp in the sequence (i.e. 16bp barcode), the UMI is similarly in the same file, right after the barcode in position 16-26 (a 10bp UMI), finally the sequence is in a separate file, starts at 0 and ends at 0 (in this case stopping at 0 means there is no limit, we use the entire sequence).
So the 0,0,16
includes position 0 and excludes position 16 for the 16bp barcode for 10xv2. So in your case, you should use 1,0,12
rather than 1,0,11
for barcode, since the latter will exclude the last base of the barcode.
Thanks for your reply,
I just tried your suggestion and used 1,0,12:1,12,19:0,0,0
and get the following output after running bustools text
command:
GCTTGCTACTCC TAAGGCA 1547760 1
GCTTAGTCTCCG GGCGGTC 809667 1
GGTAGGAGACTG AAATACG 378910 1
GGGGTCCGCATT GACGTGC 10330086 1
GTAGTGATGACG GGTTGAC 29816 1
GATCGGTCTTAA ATAGACG 378910 1
I assume the first column is the barcode which is 12bp, but my whitelist is 11bp!! One more thing, I assume the second column is the UMI sequence (correct me if I am wrong), and it is 7pb instead of 8pb!
Any suggestion of how we can fix this? Much appreciated, HM
Yes, the first column is barcode, and the second is UMI. For UMI, 1,12,19
means the base at position 12 (13th base in the actual sequence) is included, and the base at position 19 is excluded. That gives you the bases at positions 12, 13, 14, 15, 16, 17, 18, which is 7 bases. I thought you wanted 12bp barcodes since you said the whitelist is 12bp. If you want 11bp barcode that goes from position 0 to 10, then 1,0,11
is right. Here position 11 is excluded. If position 11 is supposed to be part of the UMI, then for UMI, use 1,11,19
, which will give you 8 bases.
Great, thanks for your help. I am getting exactly the length that I am looking for.
Now the next command in the pipeline (bustools correct
and bustools sort
) giving the following error:
Found 4224 barcodes in the whitelist
Number of hamming dist 1 barcodes = 100114
Error: barcode length and whitelist length differ, barcodes = 11, whitelist = 12
check that your whitelist matches the technology used
Read in 0 BUS records
Segmentation fault
Why I am getting this error? barcodes = 11
is the extracted barcode length from the bus file, and whitelist = 12
the extracted barcode length from the list that I provided.
The list was exported from a Matlab script to txt file. Do you think I have to export the concatenated barcodes (plate+cell barcodes) from another program because of some existing hidden characters? I tested the barcode text file using a text editor but didn't found any extra characters. Any suggestion? Or might be a bug in the bustools correct
code that take into account the non-ATGC characters?
Thanks, HM
The barcodes are supposed to be 11 bp right? Are the barcodes in the whitelist 11 bp when you inspect it? One hypothesis, and I don't know if it's right, but this might be relevant to the "hidden character". Windows has different line ending than Unix alikes. Windows line ending is \r\n
, while Unix is \n
. If you created the whitelist on Windows and used bustools
on Linux, then the line ending might be causing problems.
you are right, the pipeline runs smoothly after I removed the \r
. Thank you. It might make the bustools correct
to count only for the ATGC characters. zUMIs did something similar to avoid such a situation.
I am running into another issue with the velocity.R
...
> cc_tsne <- show.velocity.on.embedding.cor(emb = Embeddings(seu, "tsne"),
+ vel = Tool(seu, slot = "RunVelocity"),
+ n.cores = 50, show.grid.flow = TRUE,
+ grid.n = 50, cell.colors = cell_colors,
+ cex = 0.5, cell.border.alpha = 0,
+ arrow.scale = 2, arrow.lwd = 0.75,
+ xlab = "tsne1", ylab = "tsne2")
delta projections ... log knn ... transition probs ... done
calculating arrows ... done
grid estimates ... Error in seq.default(rx[1], rx[2], length.out = grid.n) :
'from' must be a finite number
Do you ever encutrer such thing? do you have a suggestion why I am getting this error message? I didn't found a clear solution of this error message in Github!
Thanks, HM
There's a GitHub issue about this and it's more appropriate to discuss it there: https://github.com/velocyto-team/velocyto.R/issues/23 I have never encountered this error myself, but debug(show.velocity.on.embedding.cor)
might help you to find what caused this error. It might be something about the dataset.
Thanks a lot, I will keep you updated.