scanpy support for weighted sampled data to rank genes and plot (dot, violin…

I have modified the following files in scanpy version 1.4.

_anndata.py (added support for weighted sampling data , where each row has its non-zero weight, changes made for dotPlot, violinPlot and heatmap).
_rank_genes_groups ( To find marker genes for data where each row has different non-zero weight, I have modified 't-test' and 'wilcoxon').

Suggestion :

For weighted sampling data one can modify the PCA as well, I used matlab pca for weighted data.

Thanks Khalid

May 14 '19 13:05 Khalid-Usman

There’s a few items I’d like to see changed, and you should add tests.

I have modified your suggestions , thanks. For tests you mean the 'test cases' or some 'example' data to run the program?

Thanks ...

May 16 '19 05:05 Khalid-Usman

Thank you! With “tests” I mean “functions named test_* with assert ... statements inside”

See here: https://github.com/theislab/scanpy/tree/master/scanpy/tests

May 16 '19 11:05 flying-sheep

Thank you! With “tests” I mean “functions named test_* with assert ... statements inside”

Thanks for your guidance, I have added test_weightedSampling.py with a folder named weighted_sampled in _data folder.

I have updated scanpy for weighted sampling for later tasks (clustering, finding marking genes and plotting). I also suggest to support it for initial tasks like PCA for data where each observation has weight (as in MATLAB).

Regards, Khalid

May 17 '19 08:05 Khalid-Usman

I have removed issue from the pull request by the testing tool, now the tools showed me duplications, which are mostly from other code and 1-2 from my code. Please have a look into it. It's my first pull request and its taking too much time :(

Thanks

May 20 '19 04:05 Khalid-Usman

Hi, looks great!

Ignore the tool, I think it’s a bit broken. I need to figure out what’s wrong with it. The only duplicated code left is that _prepare_weighted_dataframe is very similar to _prepare_dataframe. I think you can delete _prepare_weighted_dataframe and just change _prepare_dataframe so it does return categories, obs_tidy, categorical. Then you can change each line like categories, obs_tidy = _prepare_dataframe(…) to categories, obs_tidy, _ = _prepare_dataframe(…)

Other than that, there’s only few things left:

The tests without plots should contain assertions. I.e. in test_genes_ranking() you should do assert np.all(adata.uns['wilcoxon']['names'][:5] == ['Gene1, 'Gene2, ...]) or so!
For the plot tests, you need to add these lines to the test file:

https://github.com/theislab/scanpy/blob/d979267f48607fd609954c96cd5c586b6135dc30/scanpy/tests/test_plotting.py#L3-L4

https://github.com/theislab/scanpy/blob/d979267f48607fd609954c96cd5c586b6135dc30/scanpy/tests/test_plotting.py#L8-L13

And do each test like this (replace “xyz” with whatever you want):
```
def test_xyz(image_comparer):
    save_and_compare_images = image_comparer(ROOT, FIGS, tol=15)
    […]
    sc.pl.xyz(adata, …)
    save_and_compare_images('xyz')
```
This will make the tests save your plots to scanpy/tests/figures and compare them to the images in scanpy/test/_images. The tests will fail because scanpy/test/_images/xyz.png doesn’t exist. You need to copy the pngs from scanpy/tests/figures→scanpy/test/_images and git commit them.
This needs to be fixed: https://github.com/theislab/scanpy/pull/644#discussion_r284652144
I think the test data might be too large. @falexwolf do we have a recommended size for new test data?

@Khalid-Usman I’m sorry if you find that this takes long and is frustrating. If this is the case, just step away for a while and do something else! But I think you won’t regret doing this. You’re learning good coding practices here that will come in handy in the future, I promise!

Thank you for your contribution :tada:

May 20 '19 09:05 flying-sheep

Hi, Can I ask what weighted sampling is? And what it is used for?

May 20 '19 17:05 LuckyMD

I was using scanpy for single cell dataset, but I have sampled data with weights instead of using all rows. So I found scanpy don't handle it and i was using genes ranking and some plots from scanpy so I modified the code of scanpy to support weighted sampled data where each data point has some non-zero weight.

On Tue, May 21, 2019 at 1:22 AM MalteDLuecken [email protected] wrote:

Hi, Can I ask what weighted sampling is? And what it is used for?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOC6K57EMJHO6YMKTATPWLM3XA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZQKNI#issuecomment-494077237, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOGLFSGTBRORNQPU7JLPWLM3XANCNFSM4HMZ5G7Q .

May 20 '19 17:05 Khalid-Usman

I don't quite understand what sampled data with weights on the rows are. How do you weight individual cells in a dataset? What do weights like this mean?

May 20 '19 18:05 LuckyMD

E.g. if your original input matrix has 1,000,000 number of cells and 100 genes. You don't want to process all rows, so you can perform either uniform sampling or weighted sampling on the data. I have performed weighted sampling and sampled e.g. only 1,000 rows then each rows will have a weight.

On Tue, May 21, 2019 at 2:19 AM MalteDLuecken [email protected] wrote:

I don't quite understand what sampled data with weights on the rows are. How do you weight individual cells in a dataset? What do weights like this mean?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGODJ5CPJTB4HKVOI4M3PWLTUXA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZVDRQ#issuecomment-494096838, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOAGG3AY6DAVVZREM3TPWLTUXANCNFSM4HMZ5G7Q .

May 20 '19 18:05 Khalid-Usman

Hi Philipp,

I have updated accordingly, again no issue but the duplication and i analysed most of them are from previous code.

Regards, Khalid

On Mon, May 20, 2019 at 5:23 PM Philipp A. [email protected] wrote:

Hi, looks great!

The only duplicated code left is that _prepare_weighted_dataframe is very similar to _prepare_dataframe. I think you can delete _prepare_weighted_dataframe and just change _prepare_dataframe so it does return categories, obs_tidy, categorical. Then you can change each line like categories, obs_tidy = _prepare_dataframe(…) to categories, obs_tidy, _ = _prepare_dataframe(…)

Other than that, there’s only few things left:

The tests without plots should contain assertions. I.e. in test_genes_ranking() you should do assert np.all(adata.uns['wilcoxon']['names'][:5] == ['Gene1, 'Gene2, ...]) or so! 2.

For the plot tests, you need to add these lines to the test file:

https://github.com/theislab/scanpy/blob/d979267f48607fd609954c96cd5c586b6135dc30/scanpy/tests/test_plotting.py#L3-L13

And do each test like this (replace “xyz” with whatever you want):

def test_xyz(image_comparer):
   save_and_compare_images = image_comparer(ROOT, FIGS, tol=15)

   […]

   sc.pl.xyz(adata, …)

   save_and_compare_images('xyz')
This will make the tests save your plots to scanpy/tests/figures and compare them to the images in scanpy/test/_images. The tests will fail because scanpy/test/_images/xyz.png doesn’t exist. You need to copy the pngs from scanpy/tests/figures→scanpy/test/_images and git commit them. 3.

This needs to be fixed: #644 (comment) https://github.com/theislab/scanpy/pull/644#discussion_r284652144 4.

I think the test data might be too large. @falexwolf https://github.com/falexwolf do we have a recommended size for new test data?

@Khalid-Usman https://github.com/Khalid-Usman I’m sorry if you find that this takes long and is frustrating. If this is the case, just step away for a while and do something else! But I think you won’t regret doing this. You’re learning good coding practices here that will come in handy in the future, I promise!

Thank you for your contribution 🎉

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGODHLQPIDWZGLGTKXGLPWJUZNA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVYG3VA#issuecomment-493907412, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOG73UUYPXGY7UICKODPWJUZNANCNFSM4HMZ5G7Q .

May 20 '19 18:05 Khalid-Usman

So you want to e.g., downweight the likelihood of sampling cells with a particular feature (like a common cell type), and upweight others. What do you want to use this weighting for now in the sc.tl.rank_genes_groups function? Or in the visualization functions you changed?

May 20 '19 19:05 LuckyMD

No, I have sampled cells with weights, out of those 1000 rows most having weight=1, e.g. 1 row has weight 125, then in gene ranking the expression all genes will multiplied with that specific weight of cell, so I updated code by calculated weighted mean and variance. Before updating this I was getting wrong marker genes. Same for plotting points in dotplot, stacked_violin and heatmap.
I suggest that scanpy should support weighted data, I mean PCA should also be computed for data with weighted observations (PCA in matlab support weighted observations). Currently my input is weighted PCA data for clustering, so I don't need to update PCA code, but in future it will be a good thing to support scanpy for weighted sampled data as well.

Thanks, Khalid

On Tue, May 21, 2019 at 3:00 AM MalteDLuecken [email protected] wrote:

So you want to e.g., downweight the likelihood of sampling cells with a particular feature (like a common cell type), and upweight others. What do you want to use this weighting for now in the sc.tl.rank_genes_groups function? Or in the visualization functions you changed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOHR7Q62WL6MCWX7UWTPWLYOLA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZYS3I#issuecomment-494111085, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOHHMFZBUTLM4VIJCTLPWLYOLANCNFSM4HMZ5G7Q .

May 20 '19 19:05 Khalid-Usman

I'm not sure I entirely understand what the weights are based on. I'm trying to understand when you would suggest someone use your approach. Why do you give one cell a weight of 125? With this type of weight distribution you are basically manually changing the marker gene calculation focusing nearly only on a single cell. That seems strange to me.

I'm trying to understand the need for scanpy to support weighted observations. At the moment I don't see when you would want to differently weight the observations... I'm familiar with using weights if I have some form of measurement error or uncertainty between samples. I don't really see how that holds here. Do you weight the cells based on some kind of quality score?

May 20 '19 19:05 LuckyMD

I am using a sampling technique, which samples few rows without descreasing performance. So speed is more than 10X time faster for larger dataset with similar accuracy.

On Tue, May 21, 2019 at 3:37 AM MalteDLuecken [email protected] wrote:

I'm not sure I entirely understand what the weights are based on. I'm trying to understand when you would suggest someone use your approach. Why do you give one cell a weight of 125? With this type of weight distribution you are basically manually changing the marker gene calculation focusing nearly only on a single cell. That seems strange to me.

I'm trying to understand the need for scanpy to support weighted observations. At the moment I don't see when you would want to differently weight the observations... I'm familiar with using weights if I have some form of measurement error or uncertainty between samples. I don't really see how that holds here. Do you weight the cells based on some kind of quality score?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOC4EI2YTU53XEGMJI3PWL4XZA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZ3LJA#issuecomment-494122404, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOFRJXHAWVT6W4YKY63PWL4XZANCNFSM4HMZ5G7Q .

May 20 '19 19:05 Khalid-Usman

I understand the benefits of sampling regarding computational speed up. What I'm not clear on is how you choose your weights for the calculations you perform here. You mentioned that you get wrong marker gene results when you sample and don't use weights. That makes sense if you get a non-representative set of cells in your sample. I wonder how you select the weights to fix this. I guess you don't just try a lot of different values until one works, right?

May 21 '19 09:05 LuckyMD

Yes , the sampling is done with weights and I used the coreset technique for it.

On Tue, May 21, 2019 at 5:29 PM MalteDLuecken [email protected] wrote:

I understand the benefits of sampling regarding computational speed up. What I'm not clear on is how you choose your weights for the calculations you perform here. You mentioned that you get wrong marker gene results when you sample and don't use weights. That makes sense if you get a non-representative set of cells in your sample. I wonder how you select the weights to fix this. I guess you don't just try a lot of different values until one works, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGODYC4N7U5Y3T5XAEG3PWO6HTA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV3KJSY#issuecomment-494314699, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGODAWNXYF2AZPHG25P3PWO6HTANCNFSM4HMZ5G7Q .

May 21 '19 10:05 Khalid-Usman

Ah, okay... so you sample based on how representative a cell is of its neighbours, and then you use that weight to calculated PCA, marker genes, and perform visualizations. Is that correct?

May 21 '19 10:05 LuckyMD

This is how I understood it anyway. You circumvent the problems caused by random sampling by doing a biased sampling, and the weights correspond to the size of the represented community.

May 21 '19 11:05 flying-sheep

This is how I understood it anyway. You circumvent the problems caused by random sampling by doing a biased sampling, and the weights correspond to the size of the represented community.

If the weights are just cluster-size-related, then this would not be necessary as you are calculating cluster-specific values to compare them in e.g., dot plots and statistical tests. This is a weighting that affects the relative importance of cells in the same cluster.

May 21 '19 11:05 LuckyMD

Not clusters, I meant that cells are selected to represent a group of cells, and the weights are to specify how big that group is. E.g. if you have 5 cells that are very similar, you pick one and give it a weight that says “I represent 5 cells in the original data”. Right?

May 21 '19 12:05 flying-sheep

Exactly, It's similar as Philipp explained.

Regards, Khalid

On Tue, May 21, 2019 at 8:31 PM Philipp A. [email protected] wrote:

Not clusters, I meant that cells are selected to represent a group of cells, and the weights are to specify how big that group is. E.g. if you have 5 cells that are very similar, you pick one and give it a weight that says “I represent 5 cells in the original data”. Right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOBMAPGORTCAWUSDUZLPWPTR7A5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV3YBOA#issuecomment-494371000, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOBSYOJCNUB72M5ELD3PWPTR7ANCNFSM4HMZ5G7Q .

May 21 '19 14:05 Khalid-Usman

So, what's next ? :)

I think no more issues in code , but little duplications.

Thanks

On Tue, May 21, 2019 at 10:06 PM khalid usman [email protected] wrote:

Exactly, It's similar as Philipp explained.

Regards, Khalid

On Tue, May 21, 2019 at 8:31 PM Philipp A. [email protected] wrote:

Not clusters, I meant that cells are selected to represent a group of cells, and the weights are to specify how big that group is. E.g. if you have 5 cells that are very similar, you pick one and give it a weight that says “I represent 5 cells in the original data”. Right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOBMAPGORTCAWUSDUZLPWPTR7A5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV3YBOA#issuecomment-494371000, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOBSYOJCNUB72M5ELD3PWPTR7ANCNFSM4HMZ5G7Q .

May 21 '19 18:05 Khalid-Usman

OK, this looks really good now!

The two remaining issues are:

https://github.com/theislab/scanpy/pull/644#discussion_r284651181. I think you should not do df_weights = weights.copy(deep=True), but something like
```
df_weights = pd.DataFrame(dict(Wt=weights))
```
You need to change your code so you call the functions with a np.ndarray or so instead of your 1-column dataframe, but I think it’s the better API.
You should add the weights argument at the end of rank_genes_groups’ parameters, not the middle.

May 22 '19 09:05 flying-sheep

Long-term, we should think about the design here: Passing weights in every function call is possible, but not very nice for users. So a few questions come to mind:

Should we add scanpy.pp.coreset, which would create a sampling and add adata.obs['coreset_weights'] or simply adata.obs['weights']?

If we do that or plan to in the future, how should the added weights parameter to all these functions work?

I think it might default to 'coreset_weights'/'weights', and the functions would automatically use that .obs column if it exists. Users should also still be able to specify weights manually as in this PR.

So the type of the parameter would be Union[str, Sequence[Union[float, int]]].

All of that doesn’t really affect this PR, as we can merge it as it is and include anndata-stored weights later.

May 22 '19 09:05 flying-sheep

Thanks ,

But i will suggest to just support weights instead of coreset, may be user want to sample data with some other weighting technique. So we should ask them to just put the weights for observations, then we need to modify PCA as well and i think my code will support most of plots and marker genes, but not PCA, because my input is PCA matrix with weights for each observations.

Thanks, Khalid

On Wed, May 22, 2019 at 5:04 PM Philipp A. [email protected] wrote:

Long-term, we should think about the design here: Specifying weights all the time is possible, but not very nice for users. So a few questions come to mind:

Should we add scanpy.pp.coreset, which would create a sampling and add adata.obs['coreset_weights'] or simply adata.obs['weights']?

If we do that or plan to in the future, how should the added weights parameter to all these functions work?

I think it might default to 'coreset_weights', and the functions would automatically use that .obs column if it exists. Users should also still be able to specify weights manually as in this PR.

So the type of the parameter would be Union[str, pd.DataFrame, Sequence[Union[float, int]]].

All of that doesn’t really affect this PR, as we can merge it as it is and include anndata-stored weights later.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOC4K5CCAJSUVYSIAFDPWUEC5A5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV6M55Q#issuecomment-494718710, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGODWOGS6RD2JQ4LW5LDPWUEC5ANCNFSM4HMZ5G7Q .

May 22 '19 09:05 Khalid-Usman

Thanks Philipp, I have updated and push the code. I hope you will accept this pull request now. To support PCA and scanpy for weighted sampling, you can just set a parameter , observations/samples weights at the time user input matrix and then we can modify PCA and remaining this code is fine. I am asking for weights because user may extracted those weights either with sampling technique or may be sometime user can give weights of his own desired e.g. he want to focus one cell type etc. So we should support weights generally rather specifically.

Thanks, Khalid

On Wed, May 22, 2019 at 5:21 PM khalid usman [email protected] wrote:

Thanks ,

But i will suggest to just support weights instead of coreset, may be user want to sample data with some other weighting technique. So we should ask them to just put the weights for observations, then we need to modify PCA as well and i think my code will support most of plots and marker genes, but not PCA, because my input is PCA matrix with weights for each observations.

Thanks, Khalid

On Wed, May 22, 2019 at 5:04 PM Philipp A. [email protected] wrote:

Long-term, we should think about the design here: Specifying weights all the time is possible, but not very nice for users. So a few questions come to mind:

Should we add scanpy.pp.coreset, which would create a sampling and add adata.obs['coreset_weights'] or simply adata.obs['weights']?

If we do that or plan to in the future, how should the added weights parameter to all these functions work?

I think it might default to 'coreset_weights', and the functions would automatically use that .obs column if it exists. Users should also still be able to specify weights manually as in this PR.

So the type of the parameter would be Union[str, pd.DataFrame, Sequence[Union[float, int]]].

All of that doesn’t really affect this PR, as we can merge it as it is and include anndata-stored weights later.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOC4K5CCAJSUVYSIAFDPWUEC5A5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV6M55Q#issuecomment-494718710, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGODWOGS6RD2JQ4LW5LDPWUEC5ANCNFSM4HMZ5G7Q .

May 22 '19 15:05 Khalid-Usman

Yeah, mostly. There’s still a bit to do, some of which I did (see the commits).

The things left are that

all functions should only accept sequences (lists, ndarray, …) as weights, not dataframes.

there are multiple blocks that all look like this and can be replaced by a helper function:

categories, obs_tidy, catego = _prepare_dataframe(
    adata, var_names, groupby, use_raw, log, num_categories, layer=layer,
)
if weights is not None:
    mean_obs = _compute_gw_Avg_of_dataframe(obs_tidy, weights, catego, groupby)
    mean_obs = mean_obs.drop('Wt', axis=1)
else:
    mean_obs = obs_tidy.groupby(level=0).mean()

May 22 '19 20:05 flying-sheep

scanpy scanpy copied to clipboard

support for weighted sampled data to rank genes and plot (dot, violin…

So the type of the parameter would be Union[str, pd.DataFrame, Sequence[Union[float, int]]].

So the type of the parameter would be Union[str, pd.DataFrame, Sequence[Union[float, int]]].

scanpy
scanpy copied to clipboard