scDRS Identical assoc_mcp in results

trafficstars

Hi @martinjzhang ,

I have used scDRS (great tool, thank you!) to score a single cell dataset against a list of genes related to a disease (obtained from GWAS sum stats). I am a bit puzzled with the result for the cell type association:

`group	n_cell	n_ctrl	assoc_mcp	assoc_mcz	hetero_mcp	hetero_mcz	n_fdr_0.05	n_fdr_0.1	n_fdr_0.2
A	8938	200	0.0049751243	9.102756	0.0049751243	5.8678365	2800	3027	3296
B	1432	200	0.0049751243	4.5017133	0.0099502485	3.6684802	278	312	354
C	214	200	0.0049751243	11.462115	0.0049751243	7.4737678	139	151	159
D	142	200	0.0049751243	27.9078	0.24875621	0.7671775	142	142	142
E	373	200	0.0049751243	26.924177	0.0049751243	4.5101657	373	373	373
F	316	200	0.0049751243	29.63548	0.0049751243	3.9916444	316	316	316
G	169	200	0.0049751243	30.313704	0.024875622	2.6987221	169	169	169
H	228	200	0.0049751243	28.936302	0.029850746	2.2715647	228	228	228
I	324	200	0.0049751243	30.983482	0.0049751243	4.5710645	322	322	322
J	301	200	0.0049751243	29.881008	0.049751244	2.0174477	300	300	300
K	232	200	0.0049751243	25.27538	0.07960199	1.5823182	232	232	232
L	372	200	0.0049751243	25.981464	0.0099502485	2.542919	369	369	369
M	282	200	0.0049751243	30.938341	0.029850746	2.1587946	282	282	282
N	173	200	0.0049751243	34.117924	0.06965174	1.4641023	173	173	173
O	88	200	0.0049751243	36.643044	0.024875622	1.9150562	88	88	88
P	259	200	0.0049751243	41.091362	0.014925373	2.2539406	259	259	259
Q	273	200	0.0049751243	34.183285	0.0099502485	2.7920082	273	273	273
R	263	200	0.0049751243	38.294342	0.0099502485	2.379525	257	258	258
S	454	200	0.0049751243	40.438503	0.0049751243	2.7016406	452	452	452
T	179	200	0.0049751243	35.299545	0.2636816	0.75084305	172	172	172
U	627	200	0.0049751243	38.881256	0.0049751243	5.3322315	627	627	627
V	165	200	0.0049751243	40.369034	0.11442786	1.2370316	159	160	160
W	89	200	0.0049751243	34.50571	0.094527364	1.3114314	89	89	89
X	304	200	0.0049751243	34.702892	0.11442786	1.3594794	304	304	304
Y	260	200	0.0049751243	33.966785	0.0049751243	2.3594382	259	259	259
Z	187	200	0.0049751243	32.84846	0.014925373	2.4981256	181	182	183
AA	143	200	0.0049751243	28.68645	0.014925373	2.4380805	139	139	139
AB	359	200	0.0049751243	38.488644	0.5920398	-0.046441507	358	358	358
AC	136	200	0.0049751243	31.9772	0.19900498	0.90170306	134	134	134
AD	293	200	0.0049751243	35.52879	0.0099502485	2.445356	292	292	293
AE	438	200	0.0049751243	32.291782	0.0099502485	2.7945273	437	437	437
AF	321	200	0.0049751243	35.203712	0.04477612	1.8123521	321	321	321
AG	143	200	0.0049751243	37.025703	0.03482587	1.7594645	143	143	143
AH	201	200	0.0049751243	39.599396	0.0049751243	3.6155078	201	201	201
AI	198	200	0.0049751243	39.15733	0.2885572	0.5785125	198	198	198
AJ	122	200	0.0049751243	38.167957	0.11442786	1.1606476	120	120	120
AK	1025	200	0.0049751243	33.04211	0.03482587	2.1185117	1021	1023	1023
AL	488	200	0.0049751243	6.744684	0.0049751243	4.9473066	306	334	359
AM	1953	200	0.0049751243	7.2396345	0.0049751243	3.588506	637	696	769
AN	15425	200	0.05472637	1.7903067	0.0049751243	11.273278	915	1156	1536
AO	1614	200	0.0049751243	14.856123	0.0049751243	8.108011	1581	1586	1595
AP	4768	200	0.0049751243	20.928226	0.0049751243	11.643588	4613	4633	4644
AQ	66266	200	0.0049751243	32.22967	0.0049751243	14.024827	15426	16408	17676
AR	607	200	0.0049751243	8.526978	0.0049751243	3.1208818	257	289	311
AS	921	200	0.0049751243	11.970217	0.0049751243	5.740329	662	688	712`

Whats the reason behind getting all (except one) cell types an assoc_mcp equal to 0.004975? Also I read you recommend looking at the mcz scores...sorry for the naive question but is it the higher mcz score the better?

Thank you!

Cheers

Oct 30 '24 14:10 auesro

it means your observed test statistic is more extreme than all control statistics (for scDRS group-level analyses), indicating your cell type-disease associations are very significant. I note that 0.0049751243 = 1 / 201 I am a little surprised that all (but 1) cell types are associated. The software runs ok, but probably you want to check if you done a proper QC...

Oct 30 '24 14:10 martinjzhang

I feared something like that... Anything I could have missed beyond log-normalization, filtering to remove 0 count genes? I followed even your processing script so any suggestions more than welcome. I must add that all the celltypes include glia and neuron subtype cells (edited for clarification).

Oct 30 '24 14:10 auesro

It's common for scDRS to associate most neuronal cells to a brain disease like SCZ. Maybe looking at the different association strength across cell types will tell you something. Also, increase n_ctrl (from 200) to 1000 can improve calibration.

Oct 30 '24 14:10 martinjzhang

Thank you for the suggestion, I will run it right away with 1000 n_ctrl. How would you evaluate the "different association strength"?

Oct 30 '24 14:10 auesro

Higher norm disease scores --> stronger association. Maybe you can do a UMAP and look at which regions are more strongly associated?

Oct 30 '24 16:10 martinjzhang

Yea, sure, I did that. I can find some subtypes with clearly higher norm disease scores so I would delve into those. Also regarding the scores in the group-level analyses, mc z-scores are "better" than mc p values, right? Now, what is not clear to me is whether the higher the "better" or the lower the "better"? Thanks a lot for your help Martin!

Oct 30 '24 18:10 auesro

mc z-scores are "better" than mc p values in the sense that they can better distinguish disease association strength when the p-values are saturated.

higher z-score --> lower p-value.

p = 1 - CDF(z), roughly , where CDF is normal CDF.

Oct 30 '24 18:10 martinjzhang

Saturated p-values which is the case I find for most of my celltypes in the first table I posted, right?

Oct 30 '24 18:10 auesro

Yes, that your mc_p = 1 / (1+n_ctrl)

Oct 30 '24 18:10 martinjzhang

Hi @martinjzhang I am getting a similar result, with my lowest assoc_mcp being 0.000999001 for many of my cell type associations. From this thread I understand that first I should ensure that I have properly done QC to my scRNA seq data. I want to confirm that it was appropriate for me to use my raw single cell data in both the compute score and downstream analyses, as long as in both of these steps I set both --flag_filter True and --flag_raw_count True? Or should I have done additional preprocessing?

Alternatively may I have created way too large of a single cell dataset? I am trying to do a body wide analysis so I ended up with an AnnData object with n_obs × n_vars = 771060 × 79209. Maybe I should be trying to scale this back?

Nov 26 '24 19:11 NaomiHuntley

raw single cell data in both the compute score and downstream analyses

Yes, you are supposed to use the raw count data

If assoc_mcz are different, the results should be considered as normal.

Nov 26 '24 19:11 martinjzhang

AnnData object with n_obs × n_vars = 771060 × 79209

@NaomiHuntley how did you end up with 79209 genes? Way more than any "normal" scRNAseq study I have ever seen.

Nov 26 '24 19:11 auesro

scDRS scDRS copied to clipboard

Identical assoc_mcp in results

scDRS
scDRS copied to clipboard