souporcell
souporcell copied to clipboard
How well does it scale?
Hi,
I wanted to ask how well the tool scales with regards to the number of samples being multiplexed. We are planning an experiment with ≈ 20 samples, where we have bulk WES data as a reference for SNPs and were wondering what is the best way of performing the experiment with regards to demultiplexing. E.g. can the tool handle having all 20 samples in a single run or would it be better to distribute them into multiple lanes (e.g. 2 of 10 samples?).
Many thanks, M
The number of clusters is the most difficult thing to scale on. This is true of any cluster center based clustering algorithm. While it may work with 20 clusters, it would be much safer to do 2 of 10.
Hi thanks for the quick response!
we've done a test run with 3 samples to see how we are doing, when trying to demultiplex them with bulk WES.
I ran the whole thing with:
singularity exec -B /analysis/:/main /resources/software/souporcell_latest.sif souporcell_pipeline.py
--bam /main/cellranger_output/NYZNLW/outs/per_sample_outs/NYZNLW/count/sample_alignments.bam
--barcodes /main/cellranger_output/NYZNLW/outs/per_sample_outs/NYZNLW/count/sample_filtered_feature_bc_matrix/barcodes.tsv.gz
--fasta /main/data/ref_files/refdata-gex-GRCh38-2020-A/fasta/genome.fa
--out_dir /main/souporcell/NYZNLW/snp_demux_NYZNLW
--clusters 3 --threads 20
--known_genotypes /main/data/vcf/final_wes_2mm.vcf
--known_genotypes_sample_names WES_S3 WES_S2 WES_S1
Unfortunately, the tool crashed during the clustering step with:
***** WARNING: File /main/souporcell/NYZNLW/snp_demux_NYZNLW/depth_merged.bed has inconsistent naming convention for record:
KI270728.1 369600 369720
***** WARNING: File /main/souporcell/NYZNLW/snp_demux_NYZNLW/depth_merged.bed has inconsistent naming convention for record:
KI270728.1 369600 369720
Traceback (most recent call last):
File "/opt/souporcell/souporcell_pipeline.py", line 593, in <module>
souporcell(args, ref_mtx, alt_mtx, final_vcf)
File "/opt/souporcell/souporcell_pipeline.py", line 531, in souporcell
subprocess.check_call(cmd, stdout = log, stderr = err)
File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
We've also tried it running it with a reference vcf file that only has the chromosomal regions, but the crash still happens.
Happy for any insights, Best, M.
Could you share the .err files? Probably the clustering.err file is the most pertinent?
Hi again,
I've done a bit of digging and saw that a similar issue was reported before (https://github.com/wheaton5/souporcell/issues/137).
total loci used 20108
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
While there wasn't a clear solution in the thread, it appears that this issue is caused by the format of the common_variants_covered.vcf
file. Would you have idea on what could be issue here? And what format/columns are required?
Best, M
It requires the genotype (GT). At least that is what is erroring there.
that one must have been using known_genotypes otherwise it would not hit that part of the code. common_variants would not hit that part of the code as it would just use common variants to guide vartrix
and it seems you have the same error. so you could use --common_variants instead of --known_genotypes
Thanks for the reply(s)!
I ran the whole thing with
singularity exec -B /analysis/:/main /resources/software/souporcell_latest.sif souporcell_pipeline.py
--clusters 3 --threads 20
--known_genotypes /main/data/vcf/final_wes_2mm.vcf
--known_genotypes_sample_names WES_S3 WES_S2 WES_S1
where main/data/vcf/final_wes_2mm.vcf
looks like this, so GT should be present.
But I can try running it with --common_variants instead. In that case --common_variants should point to my genotype .vcf file? With or without --skip_remap?
Best, M
With common variants, skip remap is fine
Ok, but in this case how will I be able to map cells to samples in the end?
You would need to write a script comparing genotypes of clusters to genotypes in your vcf. Alternatively you can remove any entries in your known genotype vcf with no GT in the format field and go back to using known gentypes option
I have exactly the same problem when trying to deconvolute 2 donors based on their known genotypes. Interestingly, in one instance this has worked, but in other cases it didn't (same donors, just different samples).
Im curious what variant calling tool is outputting variants without a genotype? Seems like that is their singular purpose?
I am sorry for the confusion. I meant I am getting the same errors by souporcell as described by @Thapeachydude - I used cellsnp-lite to generate the vcf of the donor genotypes. It does provide the GT information.
Presumably not for every variant. I guess I should ignore any variant without a genotype field.
There was no confusion. The line of code it points to tries to access the GT field without checking if it exists. Of course there could be some other issue, but this is problem with 99% certainty.
I just double checked - in the donor genotype vcf all fields have GT; I am able to run souporcell on the samples without the --known_genotypes option just fine
Can you send the .err files? Perhaps it is a problem with the vcf parsing package. Still, the line of code it points to fails an unwrap of an access to the GT field. So im not sure what else it could be.
This is the command line error output:
Traceback (most recent call last):
File "/opt/souporcell/souporcell_pipeline.py", line 593, in <module>
souporcell(args, ref_mtx, alt_mtx, final_vcf)
File "/opt/souporcell/souporcell_pipeline.py", line 531, in souporcell
subprocess.check_call(cmd, stdout = log, stderr = err)
File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/souporcell/souporcell/target/release/souporcell', '-k', '2', '-a', '100/mixing_100.souporcell.reference/alt.mtx', '-r', '100/mixing_100.souporcell.reference/ref.mtx', '--restarts', '100', '-b', '100/barcodes.tsv', '--min_ref', '10', '--min_alt', '10', '--threads', '12', '--known_genotypes', '100/mixing_100.souporcell.reference/common_variants_covered.vcf', '--known_genotypes_sample_names', 'CLL1', 'CLL3']' returned non-zero exit status 101.
This is the content of clusters.err:
total loci used 5154
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
The other possibility is that the sample name is incorrect
The line of code is let gt = record.call[sample]["GT"][0].to_string();
So either sample or gt is absent. I should add an assert that the sample names are correct
This is the first 50 lines of the donor genotype vcf - sample names I used should be correct (in fact I tried this also without supplying the sample names and souporcell seems to be able to learn them itself but also throws the same error). Is there anything else that could be a problem?
##fileformat=VCFv4.2
##source=cellSNP_v1.2.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=.,Description="Filter info not available">
##INFO=<ID=DP,Number=1,Type=Integer,Description="total counts for ALT and REF">
##INFO=<ID=AD,Number=1,Type=Integer,Description="total counts for ALT">
##INFO=<ID=OTH,Number=1,Type=Integer,Description="total counts for other bases from REF and ALT">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="total counts for ALT and REF">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="total counts for ALT">
##FORMAT=<ID=OTH,Number=1,Type=Integer,Description="total counts for other bases from REF and ALT">
##FORMAT=<ID=ALL,Number=5,Type=Integer,Description="total counts for all bases in order of A,C,G,T,N">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##contig=<ID=Y>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CLL1 CLL3
1 14464 . A T . PASS AD=0;DP=29;OTH=1 GT:AD:DP:OTH:PL:ALL 0/0:0:28:1:13,97,1108:28,1,0,0,0 0/0:0:1:0:0,3,42:1,0,0,0,0
1 14599 . T A . PASS AD=45;DP=156;OTH=0 GT:AD:DP:OTH:PL:ALL 1/0:15:124:0:591,374,4443:15,0,0,109,0 1/1:30:32:0:1241,96,84:30,0,0,2,0
1 14604 . A G . PASS AD=45;DP=162;OTH=0 GT:AD:DP:OTH:PL:ALL 1/0:15:131:0:628,395,4669:116,0,15,0,0 1/1:30:31:0:1215,94,30:1,0,30,0,0
1 15820 . G T . PASS AD=2;DP=21;OTH=0 GT:AD:DP:OTH:PL:ALL 0/0:1:17:0:42,51,668:0,0,16,1,0 1/0:1:4:0:30,12,125:0,0,3,1,0
1 629218 . A G . PASS AD=304;DP=310;OTH=1 GT:AD:DP:OTH:PL:ALL 1/1:188:189:1:7516,584,57:1,0,188,1,0 1/1:116:121:0:4708,366,158:5,0,116,0,0
1 629626 . T C . PASS AD=16;DP=20;OTH=0 GT:AD:DP:OTH:PL:ALL 1/1:4:4:0:167,12,0:0,4,0,0,0 1/0:12:16:0:411,49,156:0,12,0,4,0
1 629906 . C T . PASS AD=365;DP=451;OTH=451 GT:AD:DP:OTH:PL:ALL 1/0:285:334:339:15659,5766,6341:45,49,294,285,0 1/0:80:117:112:4442,1910,2871:16,37,96,80,0
1 630026 . C T . PASS AD=599;DP=618;OTH=14 GT:AD:DP:OTH:PL:ALL 1/1:438:452:11:17413,1532,668:10,14,1,438,0 1/1:161:166:3:6568,539,210:3,5,0,161,0
1 630084 . T C . PASS AD=49;DP=50;OTH=0 GT:AD:DP:OTH:PL:ALL 1/1:35:35:0:1300,106,1:0,35,0,0,0 1/1:14:15:0:511,46,16:0,14,0,1,0
1 630110 . T C . PASS AD=48;DP=50;OTH=1 GT:AD:DP:OTH:PL:ALL 1/1:37:37:1:1416,125,15:1,37,0,0,0 1/1:11:13:0:384,40,32:0,11,0,2,0
1 630128 . G A . PASS AD=48;DP=49;OTH=1 GT:AD:DP:OTH:PL:ALL 1/1:37:38:1:1446,128,55:37,1,1,0,0 1/1:11:11:0:433,33,0:11,0,0,0,0
1 630161 . A G . PASS AD=86;DP=87;OTH=5 GT:AD:DP:OTH:PL:ALL 1/1:55:56:3:2120,235,109:1,0,55,3,0 1/1:31:31:2:1156,160,68:0,0,31,2,0
Not sure. At this point id need testing data.
Ok, I'll put the relevant files into a dropbox and share it with you in a bit; I appreciate if you could take a look - I think your tool is very useful and it'd be great to figure this out
I don't know if you saw my email - I shared some data with you.
Yes. I did. Thanks. End of week was busy and then came down with a GI bug which im still dealing with. Hopefully can get to it soon.
no worries - just wanted to make sure you got it; looking forward to any insights! :)
get better soon
Hi, so a brief update from my side.
I think I made it work (or at least not crash) by removing all variants with phased genotypes e.g. 0|1
.
Will try replacing them with unphased calls next, but I'm assuming this will work.
Interesting. It'd be great to keep those variants as simply removing seems like an expensive fix :). Thanks for looking into this!