scanpy
scanpy copied to clipboard
support for weighted sampled data to rank genes and plot (dot, violin…
I have modified the following files in scanpy
version 1.4
.
-
_anndata.py (added support for weighted sampling data , where each row has its non-zero weight, changes made for dotPlot, violinPlot and heatmap).
-
_rank_genes_groups ( To find marker genes for data where each row has different non-zero weight, I have modified 't-test' and 'wilcoxon').
Suggestion :
For weighted sampling data one can modify the PCA as well, I used matlab pca for weighted data.
Thanks Khalid
There’s a few items I’d like to see changed, and you should add tests.
I have modified your suggestions , thanks. For tests you mean the 'test cases' or some 'example' data to run the program?
Thanks ...
Thank you! With “tests” I mean “functions named test_*
with assert ...
statements inside”
See here: https://github.com/theislab/scanpy/tree/master/scanpy/tests
Thank you! With “tests” I mean “functions named
test_*
withassert ...
statements inside”
Thanks for your guidance, I have added test_weightedSampling.py
with a folder named weighted_sampled
in _data folder.
I have updated scanpy for weighted sampling for later tasks (clustering, finding marking genes and plotting). I also suggest to support it for initial tasks like PCA for data where each observation has weight (as in MATLAB).
Regards, Khalid
I have removed issue from the pull request by the testing tool, now the tools showed me duplications, which are mostly from other code and 1-2 from my code. Please have a look into it. It's my first pull request and its taking too much time :(
Thanks
Hi, looks great!
Ignore the tool, I think it’s a bit broken. I need to figure out what’s wrong with it. The only duplicated code left is that _prepare_weighted_dataframe
is very similar to _prepare_dataframe
. I think you can delete _prepare_weighted_dataframe
and just change _prepare_dataframe
so it does return categories, obs_tidy, categorical
. Then you can change each line like categories, obs_tidy = _prepare_dataframe(…)
to categories, obs_tidy, _ = _prepare_dataframe(…)
Other than that, there’s only few things left:
-
The tests without plots should contain assertions. I.e. in
test_genes_ranking()
you should doassert np.all(adata.uns['wilcoxon']['names'][:5] == ['Gene1, 'Gene2, ...])
or so! -
For the plot tests, you need to add these lines to the test file:
https://github.com/theislab/scanpy/blob/d979267f48607fd609954c96cd5c586b6135dc30/scanpy/tests/test_plotting.py#L3-L4
https://github.com/theislab/scanpy/blob/d979267f48607fd609954c96cd5c586b6135dc30/scanpy/tests/test_plotting.py#L8-L13
And do each test like this (replace “xyz” with whatever you want):
def test_xyz(image_comparer): save_and_compare_images = image_comparer(ROOT, FIGS, tol=15) […] sc.pl.xyz(adata, …) save_and_compare_images('xyz')
This will make the tests save your plots to
scanpy/tests/figures
and compare them to the images inscanpy/test/_images
. The tests will fail becausescanpy/test/_images/xyz.png
doesn’t exist. You need to copy the pngs fromscanpy/tests/figures
→scanpy/test/_images
andgit commit
them. -
This needs to be fixed: https://github.com/theislab/scanpy/pull/644#discussion_r284652144
-
I think the test data might be too large. @falexwolf do we have a recommended size for new test data?
@Khalid-Usman I’m sorry if you find that this takes long and is frustrating. If this is the case, just step away for a while and do something else! But I think you won’t regret doing this. You’re learning good coding practices here that will come in handy in the future, I promise!
Thank you for your contribution :tada:
Hi, Can I ask what weighted sampling is? And what it is used for?
I was using scanpy for single cell dataset, but I have sampled data with weights instead of using all rows. So I found scanpy don't handle it and i was using genes ranking and some plots from scanpy so I modified the code of scanpy to support weighted sampled data where each data point has some non-zero weight.
On Tue, May 21, 2019 at 1:22 AM MalteDLuecken [email protected] wrote:
Hi, Can I ask what weighted sampling is? And what it is used for?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOC6K57EMJHO6YMKTATPWLM3XA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZQKNI#issuecomment-494077237, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOGLFSGTBRORNQPU7JLPWLM3XANCNFSM4HMZ5G7Q .
I don't quite understand what sampled data with weights on the rows are. How do you weight individual cells in a dataset? What do weights like this mean?
E.g. if your original input matrix has 1,000,000 number of cells and 100 genes. You don't want to process all rows, so you can perform either uniform sampling or weighted sampling on the data. I have performed weighted sampling and sampled e.g. only 1,000 rows then each rows will have a weight.
On Tue, May 21, 2019 at 2:19 AM MalteDLuecken [email protected] wrote:
I don't quite understand what sampled data with weights on the rows are. How do you weight individual cells in a dataset? What do weights like this mean?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGODJ5CPJTB4HKVOI4M3PWLTUXA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZVDRQ#issuecomment-494096838, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOAGG3AY6DAVVZREM3TPWLTUXANCNFSM4HMZ5G7Q .
Hi Philipp,
I have updated accordingly, again no issue but the duplication and i analysed most of them are from previous code.
Regards, Khalid
On Mon, May 20, 2019 at 5:23 PM Philipp A. [email protected] wrote:
Hi, looks great!
The only duplicated code left is that _prepare_weighted_dataframe is very similar to _prepare_dataframe. I think you can delete _prepare_weighted_dataframe and just change _prepare_dataframe so it does return categories, obs_tidy, categorical. Then you can change each line like categories, obs_tidy = _prepare_dataframe(…) to categories, obs_tidy, _ = _prepare_dataframe(…)
Other than that, there’s only few things left:
The tests without plots should contain assertions. I.e. in test_genes_ranking() you should do assert np.all(adata.uns['wilcoxon']['names'][:5] == ['Gene1, 'Gene2, ...]) or so! 2.
For the plot tests, you need to add these lines to the test file:
https://github.com/theislab/scanpy/blob/d979267f48607fd609954c96cd5c586b6135dc30/scanpy/tests/test_plotting.py#L3-L13
And do each test like this (replace “xyz” with whatever you want):
def test_xyz(image_comparer):
save_and_compare_images = image_comparer(ROOT, FIGS, tol=15) […] sc.pl.xyz(adata, …) save_and_compare_images('xyz')
This will make the tests save your plots to scanpy/tests/figures and compare them to the images in scanpy/test/_images. The tests will fail because scanpy/test/_images/xyz.png doesn’t exist. You need to copy the pngs from scanpy/tests/figures→scanpy/test/_images and git commit them. 3.
This needs to be fixed: #644 (comment) https://github.com/theislab/scanpy/pull/644#discussion_r284652144 4.
I think the test data might be too large. @falexwolf https://github.com/falexwolf do we have a recommended size for new test data?
@Khalid-Usman https://github.com/Khalid-Usman I’m sorry if you find that this takes long and is frustrating. If this is the case, just step away for a while and do something else! But I think you won’t regret doing this. You’re learning good coding practices here that will come in handy in the future, I promise!
Thank you for your contribution 🎉
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGODHLQPIDWZGLGTKXGLPWJUZNA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVYG3VA#issuecomment-493907412, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOG73UUYPXGY7UICKODPWJUZNANCNFSM4HMZ5G7Q .
So you want to e.g., downweight the likelihood of sampling cells with a particular feature (like a common cell type), and upweight others. What do you want to use this weighting for now in the sc.tl.rank_genes_groups
function? Or in the visualization functions you changed?
-
No, I have sampled cells with weights, out of those 1000 rows most having weight=1, e.g. 1 row has weight 125, then in gene ranking the expression all genes will multiplied with that specific weight of cell, so I updated code by calculated weighted mean and variance. Before updating this I was getting wrong marker genes. Same for plotting points in dotplot, stacked_violin and heatmap.
-
I suggest that scanpy should support weighted data, I mean PCA should also be computed for data with weighted observations (PCA in matlab support weighted observations). Currently my input is weighted PCA data for clustering, so I don't need to update PCA code, but in future it will be a good thing to support scanpy for weighted sampled data as well.
Thanks, Khalid
On Tue, May 21, 2019 at 3:00 AM MalteDLuecken [email protected] wrote:
So you want to e.g., downweight the likelihood of sampling cells with a particular feature (like a common cell type), and upweight others. What do you want to use this weighting for now in the sc.tl.rank_genes_groups function? Or in the visualization functions you changed?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOHR7Q62WL6MCWX7UWTPWLYOLA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZYS3I#issuecomment-494111085, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOHHMFZBUTLM4VIJCTLPWLYOLANCNFSM4HMZ5G7Q .
I'm not sure I entirely understand what the weights are based on. I'm trying to understand when you would suggest someone use your approach. Why do you give one cell a weight of 125? With this type of weight distribution you are basically manually changing the marker gene calculation focusing nearly only on a single cell. That seems strange to me.
I'm trying to understand the need for scanpy to support weighted observations. At the moment I don't see when you would want to differently weight the observations... I'm familiar with using weights if I have some form of measurement error or uncertainty between samples. I don't really see how that holds here. Do you weight the cells based on some kind of quality score?
I am using a sampling technique, which samples few rows without descreasing performance. So speed is more than 10X time faster for larger dataset with similar accuracy.
On Tue, May 21, 2019 at 3:37 AM MalteDLuecken [email protected] wrote:
I'm not sure I entirely understand what the weights are based on. I'm trying to understand when you would suggest someone use your approach. Why do you give one cell a weight of 125? With this type of weight distribution you are basically manually changing the marker gene calculation focusing nearly only on a single cell. That seems strange to me.
I'm trying to understand the need for scanpy to support weighted observations. At the moment I don't see when you would want to differently weight the observations... I'm familiar with using weights if I have some form of measurement error or uncertainty between samples. I don't really see how that holds here. Do you weight the cells based on some kind of quality score?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOC4EI2YTU53XEGMJI3PWL4XZA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVZ3LJA#issuecomment-494122404, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOFRJXHAWVT6W4YKY63PWL4XZANCNFSM4HMZ5G7Q .
I understand the benefits of sampling regarding computational speed up. What I'm not clear on is how you choose your weights for the calculations you perform here. You mentioned that you get wrong marker gene results when you sample and don't use weights. That makes sense if you get a non-representative set of cells in your sample. I wonder how you select the weights to fix this. I guess you don't just try a lot of different values until one works, right?
Yes , the sampling is done with weights and I used the coreset technique for it.
On Tue, May 21, 2019 at 5:29 PM MalteDLuecken [email protected] wrote:
I understand the benefits of sampling regarding computational speed up. What I'm not clear on is how you choose your weights for the calculations you perform here. You mentioned that you get wrong marker gene results when you sample and don't use weights. That makes sense if you get a non-representative set of cells in your sample. I wonder how you select the weights to fix this. I guess you don't just try a lot of different values until one works, right?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGODYC4N7U5Y3T5XAEG3PWO6HTA5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV3KJSY#issuecomment-494314699, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGODAWNXYF2AZPHG25P3PWO6HTANCNFSM4HMZ5G7Q .
Ah, okay... so you sample based on how representative a cell is of its neighbours, and then you use that weight to calculated PCA, marker genes, and perform visualizations. Is that correct?
This is how I understood it anyway. You circumvent the problems caused by random sampling by doing a biased sampling, and the weights correspond to the size of the represented community.
This is how I understood it anyway. You circumvent the problems caused by random sampling by doing a biased sampling, and the weights correspond to the size of the represented community.
If the weights are just cluster-size-related, then this would not be necessary as you are calculating cluster-specific values to compare them in e.g., dot plots and statistical tests. This is a weighting that affects the relative importance of cells in the same cluster.
Not clusters, I meant that cells are selected to represent a group of cells, and the weights are to specify how big that group is. E.g. if you have 5 cells that are very similar, you pick one and give it a weight that says “I represent 5 cells in the original data”. Right?
Exactly, It's similar as Philipp explained.
Regards, Khalid
On Tue, May 21, 2019 at 8:31 PM Philipp A. [email protected] wrote:
Not clusters, I meant that cells are selected to represent a group of cells, and the weights are to specify how big that group is. E.g. if you have 5 cells that are very similar, you pick one and give it a weight that says “I represent 5 cells in the original data”. Right?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOBMAPGORTCAWUSDUZLPWPTR7A5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV3YBOA#issuecomment-494371000, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOBSYOJCNUB72M5ELD3PWPTR7ANCNFSM4HMZ5G7Q .
So, what's next ? :)
I think no more issues in code , but little duplications.
Thanks
On Tue, May 21, 2019 at 10:06 PM khalid usman [email protected] wrote:
Exactly, It's similar as Philipp explained.
Regards, Khalid
On Tue, May 21, 2019 at 8:31 PM Philipp A. [email protected] wrote:
Not clusters, I meant that cells are selected to represent a group of cells, and the weights are to specify how big that group is. E.g. if you have 5 cells that are very similar, you pick one and give it a weight that says “I represent 5 cells in the original data”. Right?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOBMAPGORTCAWUSDUZLPWPTR7A5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV3YBOA#issuecomment-494371000, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGOBSYOJCNUB72M5ELD3PWPTR7ANCNFSM4HMZ5G7Q .
OK, this looks really good now!
The two remaining issues are:
-
https://github.com/theislab/scanpy/pull/644#discussion_r284651181. I think you should not do
df_weights = weights.copy(deep=True)
, but something likedf_weights = pd.DataFrame(dict(Wt=weights))
You need to change your code so you call the functions with a
np.ndarray
or so instead of your 1-column dataframe, but I think it’s the better API. -
You should add the
weights
argument at the end ofrank_genes_groups
’ parameters, not the middle.
Long-term, we should think about the design here: Passing weights in every function call is possible, but not very nice for users. So a few questions come to mind:
Should we add scanpy.pp.coreset
, which would create a sampling and add adata.obs['coreset_weights']
or simply adata.obs['weights']
?
If we do that or plan to in the future, how should the added weights
parameter to all these functions work?
I think it might default to 'coreset_weights'
/'weights'
, and the functions would automatically use that .obs
column if it exists. Users should also still be able to specify weights manually as in this PR.
So the type of the parameter would be Union[str, Sequence[Union[float, int]]]
.
All of that doesn’t really affect this PR, as we can merge it as it is and include anndata-stored weights later.
Thanks ,
But i will suggest to just support weights instead of coreset, may be user want to sample data with some other weighting technique. So we should ask them to just put the weights for observations, then we need to modify PCA as well and i think my code will support most of plots and marker genes, but not PCA, because my input is PCA matrix with weights for each observations.
Thanks, Khalid
On Wed, May 22, 2019 at 5:04 PM Philipp A. [email protected] wrote:
Long-term, we should think about the design here: Specifying weights all the time is possible, but not very nice for users. So a few questions come to mind:
Should we add scanpy.pp.coreset, which would create a sampling and add adata.obs['coreset_weights'] or simply adata.obs['weights']?
If we do that or plan to in the future, how should the added weights parameter to all these functions work?
I think it might default to 'coreset_weights', and the functions would automatically use that .obs column if it exists. Users should also still be able to specify weights manually as in this PR.
So the type of the parameter would be Union[str, pd.DataFrame, Sequence[Union[float, int]]].
All of that doesn’t really affect this PR, as we can merge it as it is and include anndata-stored weights later.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOC4K5CCAJSUVYSIAFDPWUEC5A5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV6M55Q#issuecomment-494718710, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGODWOGS6RD2JQ4LW5LDPWUEC5ANCNFSM4HMZ5G7Q .
Thanks Philipp, I have updated and push the code. I hope you will accept this pull request now. To support PCA and scanpy for weighted sampling, you can just set a parameter , observations/samples weights at the time user input matrix and then we can modify PCA and remaining this code is fine. I am asking for weights because user may extracted those weights either with sampling technique or may be sometime user can give weights of his own desired e.g. he want to focus one cell type etc. So we should support weights generally rather specifically.
Thanks, Khalid
On Wed, May 22, 2019 at 5:21 PM khalid usman [email protected] wrote:
Thanks ,
But i will suggest to just support weights instead of coreset, may be user want to sample data with some other weighting technique. So we should ask them to just put the weights for observations, then we need to modify PCA as well and i think my code will support most of plots and marker genes, but not PCA, because my input is PCA matrix with weights for each observations.
Thanks, Khalid
On Wed, May 22, 2019 at 5:04 PM Philipp A. [email protected] wrote:
Long-term, we should think about the design here: Specifying weights all the time is possible, but not very nice for users. So a few questions come to mind:
Should we add scanpy.pp.coreset, which would create a sampling and add adata.obs['coreset_weights'] or simply adata.obs['weights']?
If we do that or plan to in the future, how should the added weights parameter to all these functions work?
I think it might default to 'coreset_weights', and the functions would automatically use that .obs column if it exists. Users should also still be able to specify weights manually as in this PR.
So the type of the parameter would be Union[str, pd.DataFrame, Sequence[Union[float, int]]].
All of that doesn’t really affect this PR, as we can merge it as it is and include anndata-stored weights later.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/pull/644?email_source=notifications&email_token=ABREGOC4K5CCAJSUVYSIAFDPWUEC5A5CNFSM4HMZ5G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV6M55Q#issuecomment-494718710, or mute the thread https://github.com/notifications/unsubscribe-auth/ABREGODWOGS6RD2JQ4LW5LDPWUEC5ANCNFSM4HMZ5G7Q .
Yeah, mostly. There’s still a bit to do, some of which I did (see the commits).
The things left are that
-
all functions should only accept sequences (lists, ndarray, …) as
weights
, not dataframes. -
there are multiple blocks that all look like this and can be replaced by a helper function:
categories, obs_tidy, catego = _prepare_dataframe( adata, var_names, groupby, use_raw, log, num_categories, layer=layer, ) if weights is not None: mean_obs = _compute_gw_Avg_of_dataframe(obs_tidy, weights, catego, groupby) mean_obs = mean_obs.drop('Wt', axis=1) else: mean_obs = obs_tidy.groupby(level=0).mean()