scanpy Subsample by observations grouping

[X] Additional function parameters / changed functionality / changed defaults?
[ ] New analysis tool: A simple analysis tool you have been using and are missing in sc.tools?
[ ] New plotting function: A kind of plot you would like to seein sc.pl?
[ ] External tools: Do you know an existing package that should go into sc.external.*?
[ ] Other?

Related to scanpy.pp.subsample, it would be useful to have a subsampling tool that subsamples based on the key of an observations grouping. E.g., if I have an observation key 'MyGroup' with possible values ['A', 'B'], and there are 10,000 cells of type 'A' and 2,000 cells of type 'B' and I want only max 5,000 cells of each type, then this function would subsample 5,000 cells of type 'A' but retain all 2,000 cells of type 'B'.

Jan 12 '20 23:01 chsher

Something like this should work. Note, this is not tested.

target_cells = 5000

adatas = [adata[adata.obs[cluster_key].isin(clust)] for clust in adata.obs[cluster_key].cat.categories]

for dat in adatas:
    if dat.n_obs > target_cells:
         sc.pp.subsample(dat, n_obs=target_cells)

adata_downsampled = adatas[0].concatenate(*adatas[1:])

Hope that helps.

Jan 14 '20 08:01 LuckyMD

Thank you @LuckyMD, it worked!

Dec 03 '20 20:12 chsher

I'll reopen this cause I think it's quite relevant still and could be very straightforward to implement with sklearn resample

also, there is an entire package for subsampling strategies which is probably quite relevant: https://github.com/scikit-learn-contrib/imbalanced-learn

line here for reference: https://github.com/theislab/scanpy/blob/48cc7b38f1f31a78902a892041902cc810ddfcd3/scanpy/preprocessing/_simple.py#L857

Nov 18 '21 14:11 giovp

back here reminding myself that this would be very useful feature to have...

Feb 17 '22 09:02 giovp

@bio-la also expressed some interest here on MM

@giovp, did you have a particular strategy in mind for resampling?

Feb 17 '22 15:02 ivirshup

So assuming that we are only interested in downsampling, then I'd say NearMiss and related are straightforward and scalable (just need to compute a kmeans whcih is really fast)

Feb 17 '22 16:02 giovp

also, the fact that reshuflling is performed is not in docs and should be documented. @bio-la do you plan to work on this?

Feb 28 '22 12:02 giovp

then I'd say NearMiss and related are straightforward and scalable (just need to compute a kmeans whcih is really fast)

For sampling from datasets, I would want to go with either extremely straightforward or something that has been shown to work. Maybe we could start with use provided labels to downsample by?

reshuflling is performed

Reshuffling meaning that the order is changed?

Feb 28 '22 13:02 ivirshup

Linking some previous discussion:

https://github.com/theislab/scanpy/pull/943
https://github.com/theislab/scanpy/pull/1382

Feb 28 '22 13:02 ivirshup

clust

in scanpy1.8 , this works

`target_cells = 3000

adatas = [adata_train[adata_train.obs[cluster_key].isin([clust])] for clust in adata_train.obs[cluster_key].cat.categories]

for dat in adatas: if dat.n_obs > target_cells: sc.pp.subsample(dat, n_obs=target_cells, random_state=0)

adata_train_downsampled1 = adatas[0].concatenate(*adatas[1:])`

Apr 08 '22 12:04 chansigit

This function at least subsamples all classes in an obs column to the same number of cells. Would be straightforward to modify to what you probably think of.

def obs_key_wise_subsampling(adata, obs_key, N):
    '''
    Subsample each class to same cell numbers (N). Classes are given by obs_key pointing to categorical in adata.obs.
    '''
    counts = adata.obs[obs_key].value_counts()
    # subsample indices per group defined by obs_key
    indices = [np.random.choice(adata.obs_names[adata.obs[obs_key]==group], size=N, replace=False) for group in counts.index]
    selection = np.hstack(np.array(indices))
    return adata[selection].copy()

Jan 18 '23 14:01 stefanpeidli

@stefanpeidli's code gives this error

ValueError: Cannot take a larger sample than population when 'replace=False'

If a group has less than required number observations, it shouldn't subsample.

target_cells = 1000
cluster_key = "cell_type"

grouped = adata.obs.groupby(cluster_key)
downsampled_indices = []

for _, group in grouped:
    if len(group) > target_cells:
        downsampled_indices.extend(group.sample(target_cells).index)
    else:
        downsampled_indices.extend(group.index)

adata_downsampled = adata[downsampled_indices]

Jan 19 '23 14:01 royfrancis

scanpy scanpy copied to clipboard

Subsample by observations grouping

scanpy
scanpy copied to clipboard