souporcell icon indicating copy to clipboard operation
souporcell copied to clipboard

How well does it scale?

Open Thapeachydude opened this issue 1 year ago • 32 comments

Hi,

I wanted to ask how well the tool scales with regards to the number of samples being multiplexed. We are planning an experiment with ≈ 20 samples, where we have bulk WES data as a reference for SNPs and were wondering what is the best way of performing the experiment with regards to demultiplexing. E.g. can the tool handle having all 20 samples in a single run or would it be better to distribute them into multiple lanes (e.g. 2 of 10 samples?).

Many thanks, M

Thapeachydude avatar Mar 03 '23 09:03 Thapeachydude

The number of clusters is the most difficult thing to scale on. This is true of any cluster center based clustering algorithm. While it may work with 20 clusters, it would be much safer to do 2 of 10.

wheaton5 avatar Mar 03 '23 14:03 wheaton5

Hi thanks for the quick response!

we've done a test run with 3 samples to see how we are doing, when trying to demultiplex them with bulk WES.

I ran the whole thing with:

singularity exec -B /analysis/:/main /resources/software/souporcell_latest.sif souporcell_pipeline.py 
--bam /main/cellranger_output/NYZNLW/outs/per_sample_outs/NYZNLW/count/sample_alignments.bam 
--barcodes /main/cellranger_output/NYZNLW/outs/per_sample_outs/NYZNLW/count/sample_filtered_feature_bc_matrix/barcodes.tsv.gz 
--fasta /main/data/ref_files/refdata-gex-GRCh38-2020-A/fasta/genome.fa 
--out_dir /main/souporcell/NYZNLW/snp_demux_NYZNLW 
--clusters 3 --threads 20 
--known_genotypes /main/data/vcf/final_wes_2mm.vcf 
--known_genotypes_sample_names WES_S3 WES_S2 WES_S1

Unfortunately, the tool crashed during the clustering step with:

***** WARNING: File /main/souporcell/NYZNLW/snp_demux_NYZNLW/depth_merged.bed has inconsistent naming convention for record:
KI270728.1	369600	369720

***** WARNING: File /main/souporcell/NYZNLW/snp_demux_NYZNLW/depth_merged.bed has inconsistent naming convention for record:
KI270728.1	369600	369720

Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 593, in <module>
    souporcell(args, ref_mtx, alt_mtx, final_vcf)
  File "/opt/souporcell/souporcell_pipeline.py", line 531, in souporcell
    subprocess.check_call(cmd, stdout = log, stderr = err)
  File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)

We've also tried it running it with a reference vcf file that only has the chromosomal regions, but the crash still happens.

Happy for any insights, Best, M.

Thapeachydude avatar Mar 25 '23 09:03 Thapeachydude

Could you share the .err files? Probably the clustering.err file is the most pertinent?

wheaton5 avatar Mar 25 '23 09:03 wheaton5

Thanks for the quick reply!

Sure, cluster, retag and minimap error files attached. errors.zip

Thapeachydude avatar Mar 25 '23 20:03 Thapeachydude

Hi again,

I've done a bit of digging and saw that a similar issue was reported before (https://github.com/wheaton5/souporcell/issues/137).

total loci used 20108
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30

While there wasn't a clear solution in the thread, it appears that this issue is caused by the format of the common_variants_covered.vcf file. Would you have idea on what could be issue here? And what format/columns are required?

Best, M

Thapeachydude avatar Mar 28 '23 14:03 Thapeachydude

It requires the genotype (GT). At least that is what is erroring there.

wheaton5 avatar Mar 28 '23 15:03 wheaton5

that one must have been using known_genotypes otherwise it would not hit that part of the code. common_variants would not hit that part of the code as it would just use common variants to guide vartrix

wheaton5 avatar Mar 28 '23 15:03 wheaton5

and it seems you have the same error. so you could use --common_variants instead of --known_genotypes

wheaton5 avatar Mar 28 '23 15:03 wheaton5

Thanks for the reply(s)!

I ran the whole thing with

singularity exec -B /analysis/:/main /resources/software/souporcell_latest.sif souporcell_pipeline.py 
--clusters 3 --threads 20 
--known_genotypes /main/data/vcf/final_wes_2mm.vcf 
--known_genotypes_sample_names WES_S3 WES_S2 WES_S1

where main/data/vcf/final_wes_2mm.vcf looks like this, so GT should be present. image

But I can try running it with --common_variants instead. In that case --common_variants should point to my genotype .vcf file? With or without --skip_remap?

Best, M

Thapeachydude avatar Mar 28 '23 15:03 Thapeachydude

With common variants, skip remap is fine

wheaton5 avatar Mar 28 '23 15:03 wheaton5

Ok, but in this case how will I be able to map cells to samples in the end?

Thapeachydude avatar Mar 28 '23 16:03 Thapeachydude

You would need to write a script comparing genotypes of clusters to genotypes in your vcf. Alternatively you can remove any entries in your known genotype vcf with no GT in the format field and go back to using known gentypes option

wheaton5 avatar Mar 28 '23 16:03 wheaton5

I have exactly the same problem when trying to deconvolute 2 donors based on their known genotypes. Interestingly, in one instance this has worked, but in other cases it didn't (same donors, just different samples).

liviuspenter avatar Mar 30 '23 12:03 liviuspenter

Im curious what variant calling tool is outputting variants without a genotype? Seems like that is their singular purpose?

wheaton5 avatar Mar 30 '23 14:03 wheaton5

I am sorry for the confusion. I meant I am getting the same errors by souporcell as described by @Thapeachydude - I used cellsnp-lite to generate the vcf of the donor genotypes. It does provide the GT information.

liviuspenter avatar Mar 30 '23 14:03 liviuspenter

Presumably not for every variant. I guess I should ignore any variant without a genotype field.

wheaton5 avatar Mar 30 '23 14:03 wheaton5

There was no confusion. The line of code it points to tries to access the GT field without checking if it exists. Of course there could be some other issue, but this is problem with 99% certainty.

wheaton5 avatar Mar 30 '23 14:03 wheaton5

I just double checked - in the donor genotype vcf all fields have GT; I am able to run souporcell on the samples without the --known_genotypes option just fine

liviuspenter avatar Mar 30 '23 14:03 liviuspenter

Can you send the .err files? Perhaps it is a problem with the vcf parsing package. Still, the line of code it points to fails an unwrap of an access to the GT field. So im not sure what else it could be.

wheaton5 avatar Mar 30 '23 14:03 wheaton5

This is the command line error output:

Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 593, in <module>
    souporcell(args, ref_mtx, alt_mtx, final_vcf)
  File "/opt/souporcell/souporcell_pipeline.py", line 531, in souporcell
    subprocess.check_call(cmd, stdout = log, stderr = err) 
  File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/souporcell/souporcell/target/release/souporcell', '-k', '2', '-a', '100/mixing_100.souporcell.reference/alt.mtx', '-r', '100/mixing_100.souporcell.reference/ref.mtx', '--restarts', '100', '-b', '100/barcodes.tsv', '--min_ref', '10', '--min_alt', '10', '--threads', '12', '--known_genotypes', '100/mixing_100.souporcell.reference/common_variants_covered.vcf', '--known_genotypes_sample_names', 'CLL1', 'CLL3']' returned non-zero exit status 101.

This is the content of clusters.err:

total loci used 5154
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30
thread '<unnamed>' panicked at 'no entry found for key', src/main.rs:325:30

liviuspenter avatar Mar 30 '23 15:03 liviuspenter

The other possibility is that the sample name is incorrect

wheaton5 avatar Mar 30 '23 15:03 wheaton5

The line of code is let gt = record.call[sample]["GT"][0].to_string();

So either sample or gt is absent. I should add an assert that the sample names are correct

wheaton5 avatar Mar 30 '23 15:03 wheaton5

This is the first 50 lines of the donor genotype vcf - sample names I used should be correct (in fact I tried this also without supplying the sample names and souporcell seems to be able to learn them itself but also throws the same error). Is there anything else that could be a problem?

##fileformat=VCFv4.2
##source=cellSNP_v1.2.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=.,Description="Filter info not available">
##INFO=<ID=DP,Number=1,Type=Integer,Description="total counts for ALT and REF">
##INFO=<ID=AD,Number=1,Type=Integer,Description="total counts for ALT">
##INFO=<ID=OTH,Number=1,Type=Integer,Description="total counts for other bases from REF and ALT">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="total counts for ALT and REF">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="total counts for ALT">
##FORMAT=<ID=OTH,Number=1,Type=Integer,Description="total counts for other bases from REF and ALT">
##FORMAT=<ID=ALL,Number=5,Type=Integer,Description="total counts for all bases in order of A,C,G,T,N">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##contig=<ID=Y>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	CLL1	CLL3
1	14464	.	A	T	.	PASS	AD=0;DP=29;OTH=1	GT:AD:DP:OTH:PL:ALL	0/0:0:28:1:13,97,1108:28,1,0,0,0	0/0:0:1:0:0,3,42:1,0,0,0,0
1	14599	.	T	A	.	PASS	AD=45;DP=156;OTH=0	GT:AD:DP:OTH:PL:ALL	1/0:15:124:0:591,374,4443:15,0,0,109,0	1/1:30:32:0:1241,96,84:30,0,0,2,0
1	14604	.	A	G	.	PASS	AD=45;DP=162;OTH=0	GT:AD:DP:OTH:PL:ALL	1/0:15:131:0:628,395,4669:116,0,15,0,0	1/1:30:31:0:1215,94,30:1,0,30,0,0
1	15820	.	G	T	.	PASS	AD=2;DP=21;OTH=0	GT:AD:DP:OTH:PL:ALL	0/0:1:17:0:42,51,668:0,0,16,1,0	1/0:1:4:0:30,12,125:0,0,3,1,0
1	629218	.	A	G	.	PASS	AD=304;DP=310;OTH=1	GT:AD:DP:OTH:PL:ALL	1/1:188:189:1:7516,584,57:1,0,188,1,0	1/1:116:121:0:4708,366,158:5,0,116,0,0
1	629626	.	T	C	.	PASS	AD=16;DP=20;OTH=0	GT:AD:DP:OTH:PL:ALL	1/1:4:4:0:167,12,0:0,4,0,0,0	1/0:12:16:0:411,49,156:0,12,0,4,0
1	629906	.	C	T	.	PASS	AD=365;DP=451;OTH=451	GT:AD:DP:OTH:PL:ALL	1/0:285:334:339:15659,5766,6341:45,49,294,285,0	1/0:80:117:112:4442,1910,2871:16,37,96,80,0
1	630026	.	C	T	.	PASS	AD=599;DP=618;OTH=14	GT:AD:DP:OTH:PL:ALL	1/1:438:452:11:17413,1532,668:10,14,1,438,0	1/1:161:166:3:6568,539,210:3,5,0,161,0
1	630084	.	T	C	.	PASS	AD=49;DP=50;OTH=0	GT:AD:DP:OTH:PL:ALL	1/1:35:35:0:1300,106,1:0,35,0,0,0	1/1:14:15:0:511,46,16:0,14,0,1,0
1	630110	.	T	C	.	PASS	AD=48;DP=50;OTH=1	GT:AD:DP:OTH:PL:ALL	1/1:37:37:1:1416,125,15:1,37,0,0,0	1/1:11:13:0:384,40,32:0,11,0,2,0
1	630128	.	G	A	.	PASS	AD=48;DP=49;OTH=1	GT:AD:DP:OTH:PL:ALL	1/1:37:38:1:1446,128,55:37,1,1,0,0	1/1:11:11:0:433,33,0:11,0,0,0,0
1	630161	.	A	G	.	PASS	AD=86;DP=87;OTH=5	GT:AD:DP:OTH:PL:ALL	1/1:55:56:3:2120,235,109:1,0,55,3,0	1/1:31:31:2:1156,160,68:0,0,31,2,0

liviuspenter avatar Mar 30 '23 15:03 liviuspenter

Not sure. At this point id need testing data.

wheaton5 avatar Mar 30 '23 15:03 wheaton5

Ok, I'll put the relevant files into a dropbox and share it with you in a bit; I appreciate if you could take a look - I think your tool is very useful and it'd be great to figure this out

liviuspenter avatar Mar 30 '23 15:03 liviuspenter

I don't know if you saw my email - I shared some data with you.

liviuspenter avatar Apr 02 '23 16:04 liviuspenter

Yes. I did. Thanks. End of week was busy and then came down with a GI bug which im still dealing with. Hopefully can get to it soon.

wheaton5 avatar Apr 02 '23 16:04 wheaton5

no worries - just wanted to make sure you got it; looking forward to any insights! :)

get better soon

liviuspenter avatar Apr 02 '23 16:04 liviuspenter

Hi, so a brief update from my side. I think I made it work (or at least not crash) by removing all variants with phased genotypes e.g. 0|1. Will try replacing them with unphased calls next, but I'm assuming this will work.

Thapeachydude avatar Apr 03 '23 08:04 Thapeachydude

Interesting. It'd be great to keep those variants as simply removing seems like an expensive fix :). Thanks for looking into this!

liviuspenter avatar Apr 03 '23 10:04 liviuspenter