scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

cluster average expression

Open wangjiawen2013 opened this issue 7 years ago • 10 comments

Dear, Can you add a function to calculate the average expression of each gene in each cluster ?

wangjiawen2013 avatar Jun 22 '18 03:06 wangjiawen2013

And, I want change the default colors of the tsne plot, but I don't know the relationship between the parameters 'palette' and 'color_map' and I have tried some settings, but they don't work, Can you help me ?

wangjiawen2013 avatar Jun 22 '18 07:06 wangjiawen2013

Hi! You can always change the colors of a categorical annotation by directly modifying, e.g. adata.uns['louvain_colors']. It's a bug that palettes doesn't change it when the colors field is present in adata.uns. I'll try to fix this today.

How do you want the mean for each cluster to be visualized or stored? If you run pl.paga(adata, color='mygene') the mean per cluster is plotted.

falexwolf avatar Jun 25 '18 09:06 falexwolf

Some times we need to know the average expression of each gene in each cluster, it's better to store it in a pandas dataframe with rows corresponding to each gene and columns corresponding to each cluster, then we can export it to a csv file.

wangjiawen2013 avatar Jun 26 '18 08:06 wangjiawen2013

I have written a function to do this... I could add it to scanpy. But for the moment, this is it:

def marker_gene_expression(anndata, marker_dict, gene_symbol_key=None, partition_key='louvain_r1'):
    """A function go get mean z-score expressions of marker genes
    # 
    # Inputs:
    #    anndata         - An AnnData object containing the data set and a partition
    #    marker_dict     - A dictionary with cell-type markers. The markers should be stores as anndata.var_names or 
    #                      an anndata.var field with the key given by the gene_symbol_key input
    #    gene_symbol_key - The key for the anndata.var field with gene IDs or names that correspond to the marker 
    #                      genes
    #    partition_key   - The key for the anndata.obs field where the cluster IDs are stored. The default is
    #                      'louvain_r1' """

    #Test inputs
    if partition_key not in anndata.obs.columns.values:
        print('KeyError: The partition key was not found in the passed AnnData object.')
        print('   Have you done the clustering? If so, please tell pass the cluster IDs with the AnnData object!')
        raise

    if (gene_symbol_key != None) and (gene_symbol_key not in anndata.var.columns.values):
        print('KeyError: The provided gene symbol key was not found in the passed AnnData object.')
        print('   Check that your cell type markers are given in a format that your anndata object knows!')
        raise
        

    if gene_symbol_key:
        gene_ids = anndata.var[gene_symbol_key]
    else:
        gene_ids = anndata.var_names

    clusters = anndata.obs[partition_key].cat.categories
    n_clust = len(clusters)
    marker_exp = pd.DataFrame(columns=clusters)
    marker_exp['cell_type'] = pd.Series({}, dtype='str')
    marker_names = []
    
    z_scores = sc.pp.scale(anndata, copy=True)

    i = 0
    for group in marker_dict:
        # Find the corresponding columns and get their mean expression in the cluster
        for gene in marker_dict[group]:
            ens_idx = np.in1d(gene_ids, gene) #Note there may be multiple mappings
            if np.sum(ens_idx) == 0:
                continue
            else:
                z_scores.obs[ens_idx[0]] = z_scores.X[:,ens_idx].mean(1) #works for both single and multiple mapping
                ens_idx = ens_idx[0]

            clust_marker_exp = z_scores.obs.groupby(partition_key)[ens_idx].apply(np.mean).tolist()
            clust_marker_exp.append(group)
            marker_exp.loc[i] = clust_marker_exp
            marker_names.append(gene)
            i+=1

    #Replace the rownames with informative gene symbols
    marker_exp.index = marker_names

    return(marker_exp)

You just need to add a python dictionary called 'marker_dict' where keys are strings and values are lists of strings. I think it's explained in the function itself as well though.

LuckyMD avatar Jun 26 '18 09:06 LuckyMD

@LuckyMD Have you added the function above to Scanpy yet ? If yes, what is it called ? Thanks, Aditi

aditisk avatar Feb 14 '19 20:02 aditisk

Hi, I haven't added it yet. But I have a new impetus to do so due to some reviewer's comments... so I will submit a pull request soon ;).

LuckyMD avatar Feb 15 '19 09:02 LuckyMD

When you add this, or if you have added it already, would you mention the new name on this thread? It's hard for me to know where to look in the docs. Thanks!

ekernf01 avatar Sep 24 '19 12:09 ekernf01

It's still not in a PR. I actually forgot about it as well atm... sorry. For now just copy the code above, please.

LuckyMD avatar Sep 24 '19 15:09 LuckyMD

As a general approach to this kind of problem, I write functions like this:



def grouped_obs_mean(adata, group_key, layer=None, gene_symbols=None):
    if layer is not None:
        getX = lambda x: x.layers[layer]
    else:
        getX = lambda x: x.X
    if gene_symbols is not None:
        new_idx = adata.var[idx]
    else:
        new_idx = adata.var_names

    grouped = adata.obs.groupby(group_key)
    out = pd.DataFrame(
        np.zeros((adata.shape[1], len(grouped)), dtype=np.float64),
        columns=list(grouped.groups.keys()),
        index=adata.var_names
    )

    for group, idx in grouped.indices.items():
        X = getX(adata[idx])
        out[group] = np.ravel(X.mean(axis=0, dtype=np.float64))
    return out

Swapping out the last 8 lines or so depending on what I'm calculating. To use a set of marker genes I'd call it as grouped_obs_mean(adata[:, marker_genes], ...).

At some point we might have groupby for AnnDatas, but that'll require figuring out how to be consistent about the returned type.

ivirshup avatar Sep 25 '19 06:09 ivirshup

As a general approach to this kind of problem, I write functions like this:

def grouped_obs_mean(adata, group_key, layer=None, gene_symbols=None):
    if layer is not None:
        getX = lambda x: x.layers[layer]
    else:
        getX = lambda x: x.X
    if gene_symbols is not None:
        new_idx = adata.var[idx]
    else:
        new_idx = adata.var_names

    grouped = adata.obs.groupby(group_key)
    out = pd.DataFrame(
        np.zeros((adata.shape[1], len(grouped)), dtype=np.float64),
        columns=list(grouped.groups.keys()),
        index=adata.var_names
    )

    for group, idx in grouped.indices.items():
        X = getX(adata[idx])
        out[group] = np.ravel(X.mean(axis=0, dtype=np.float64))
    return out

Swapping out the last 8 lines or so depending on what I'm calculating. To use a set of marker genes I'd call it as grouped_obs_mean(adata[:, marker_genes], ...).

At some point we might have groupby for AnnDatas, but that'll require figuring out how to be consistent about the returned type.

Thanks. But need to make a tiny amendment to make it work now:

out[group] = np.ravel(X.mean(axis=0, dtype=np.float64)).tolist()

yeswzc avatar Dec 29 '23 04:12 yeswzc