cellsnp-lite
cellsnp-lite copied to clipboard
Use cellSNP on a big BAM file
Hi! I would be grateful if you can answer my question. I am trying to use the cellSNP line tool on a large bam file (1.2 T) generated by SMARTseq. I aim to count all the possible variants in a specific gene. Do you think I should use the mode 2? Can the analysis be limited to a specific gene? I'm trying this command:
cellsnp-lite -s myfile.bam -b barcodes.txt -O ./results --chrom 3 --minMAF 0.1 --minCOUNT 100 -p 20 -f chr3_genome.fa
But the vcf in output is empty.
here the header of my bam file:
A00560:334:HLKGKDSX7:1:2272:3631:22216 163 3 10001 1 2S55M93S = 10284 326 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCCATTGCGCAATCTGTCTCTTATACACATCTGACGCTGCCGACGACGGCACTGAAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAGGGGGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FF,FF:,F:F:,,,:FFFF NH:i:4 HI:i:1 AS:i:96 nM:i:0 BX:Z:TGATCTGCACCGGCACTGAA BC:Z:TGATCTGCACCGGCACTGAA QB:Z:FFFFFFFFFFFFFFFFFFFF QU:Z:FFFFFFFFFF ES:Z:Unassigned_NoFeatures IS:Z:Unassigned_NoFeatures UX:Z:GGTGAGGGTT UB:Z:GGTGAGGGTT
A00560:334:HLKGKDSX7:2:2674:26458:12994 163 3 10001 0 26S54M70S = 10389 490 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCATTGCGCAATCTGTCTCTTATACACATCTGACGCTGCCGACGACTCGTTAACGGTGTAGATCTCGGTGGT :FFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF:FFFFF,:FFFFF:FFFFF:FFFFFFFFFF:FFF:::FFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFF,FFFFFFFFFFF, NH:i:40 HI:i:1 AS:i:116 nM:i:1 BX:Z:ACAATGGACACTCGTTAACG BC:Z:ACAATGGACACTCGTTAACG QB:Z:FFFFFFFFFFFFFFFFFFFF QU:Z:FFFFFFFFFF ES:Z:Unassigned_NoFeatures IS:Z:Unassigned_NoFeatures UX:Z:GTGAGGGTTA UB:Z:GTGAGGGTTA
A00560:334:HLKGKDSX7:3:1412:5348:2206 163 3 10001 3 8S54M88S = 10284 331 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCATTGCGCAATCTGTCTCTTATACACATCTGACGCTGCCGACGACTAGTAGTGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FF:FFFFFFF::FFF:FFFFFFFFFFFFF,F,FFFFFFFFFF::F,FFFFFFFFFFF:F:F:FF:FFFF,,, NH:i:2 HI:i:1 AS:i:100 nM:i:0 BX:Z:CGCAGTTGTTCTAGTAGTGA BC:Z:CGCAGTTGTTCTAGTAGTGA QB:Z:FFFFFFFFFFFFFFFFFFFF QU:Z:FFFFFFFFFF ES:Z:Unassigned_NoFeatures IS:Z:Unassigned_NoFeatures UX:Z:GTGAGGGTTA UB:Z:GTGAGGGTTA
I really appreciate any help you can provide. Best
Hi, thanks for the question.
Cellsnp-lite does not have an option for genotyping a specific region (gene) now. As your BAM is indeed large, mode 2a could be very slow and you may try mode 2b + 1a instead, i.e., calling first in a bulk manner by mode 2b followed by genotyping in mode 1a.
Alternatively, you may also run mode 1a directly if you can prepare a VCF listing every position of that gene, i.e., specify the CHROM, POS, REF as it is and set the ALT as "." in the VCF file (or set both REF and ALT as ".", and then use -f
option to extract the reference allele from the FASTA file). Cellsnp-lite would assign the allele with the highest UMI/read counts except REF as the ALT.
Generally, the mode 2b + 1a is recommended. Please check the example of each mode in the manual.
Is your merged BAM file from SMART-seq3? As far as I know, SMART-seq2 will produce one BAM file per cell. The empty output could be due to the unmatched cell tag. The defualt cell tag of cellsnp-lite is "CB", which is missing in the three BAM records you shared. You may set proper cell tag with --cellTAG
option, based on your experimental settings.