ehrapy
ehrapy copied to clipboard
improve move_to_x
PR Checklist
- [ ] This comment contains a description of changes (with reason)
- [ ] Referenced issue is linked
- [ ] If you've fixed a bug or added code that should be tested, add tests!
- [ ] Documentation in
docs
is updated
Description of changes
Add three arguments to move_to_x: copy_uns
, copy_obsm
, and copy_varm
and they are all set to be True and optional.
Technical details Tests are already been added.
Additional context
A few points on this:
- I wonder if it is a good idea to store computed results (e.g.
adata.obsm["X_pca"]
) together with altered data that would actually yield different results if one would recompute. Is there a striking example justifying to do that? - If
.obsm
is copied,.obsp
might be as well following this logic I think? - Copying
.varm
should be omitted in any case my opinion; This contains per-feature computed results. By doingmove_to_x
, a new feature is actually added. In addition to the thoughts about changed data above, here the dimensionality of the data does not match anymore. Example below of how this pretty quickly fails :)
import ehrapy as ep
adata = ep.dt.diabetes_130(columns_obs_only=["race", "gender", "age"])
adata_prep = adata[
:,
(adata.var.index == "time_in_hospital_days")
| (adata.var.index == "num_lab_procedures")
| (adata.var.index == "num_procedures")
| (adata.var.index == "num_medications")
| (adata.var.index == "number_diagnoses"),
]
ep.pp.pca(adata_prep, n_comps=2)
adata_prep.varm["PCs"]
adata_prep_2 = ep.ad.move_to_x(adata_prep, "gender", copy_varm=True)
ValueError: Value passed for key 'PCs' is of incorrect shape. Values of varm must match dimensions ('var',) of parent. Value had shape (5,) while it should have had (6,).
Considering that move_to_x
creates somewhat a new dataset, it might be justified and/or necessary to (re)compute things?
@xinyuejohn could you also outline for us again for what exactly you need this feature, please? I do totally see @eroell points
A few points on this:
- I wonder if it is a good idea to store computed results (e.g.
adata.obsm["X_pca"]
) together with altered data that would actually yield different results if one would recompute. Is there a striking example justifying to do that?- If
.obsm
is copied,.obsp
might be as well following this logic I think?- Copying
.varm
should be omitted in any case my opinion; This contains per-feature computed results. By doingmove_to_x
, a new feature is actually added. In addition to the thoughts about changed data above, here the dimensionality of the data does not match anymore. Example below of how this pretty quickly fails :)import ehrapy as ep adata = ep.dt.diabetes_130(columns_obs_only=["race", "gender", "age"]) adata_prep = adata[ :, (adata.var.index == "time_in_hospital_days") | (adata.var.index == "num_lab_procedures") | (adata.var.index == "num_procedures") | (adata.var.index == "num_medications") | (adata.var.index == "number_diagnoses"), ] ep.pp.pca(adata_prep, n_comps=2) adata_prep.varm["PCs"] adata_prep_2 = ep.ad.move_to_x(adata_prep, "gender", copy_varm=True)
ValueError: Value passed for key 'PCs' is of incorrect shape. Values of varm must match dimensions ('var',) of parent. Value had shape (5,) while it should have had (6,).
Considering that
move_to_x
creates somewhat a new dataset, it might be justified and/or necessary to (re)compute things?
Thanks for your reply and I totally got your points!
Here's my use case:
adata.obs = pd.merge(adata.obs, df_statistics, how="left", left_index=True, right_index=True)
adata = ep.ad.move_to_x(adata, list(df_statistics.columns))
adata.obsm = obsm
In .obsm, there are some awkward arrays which still valid as long as the length of observations keep same.
Thanks for the example - so to my understanding you add some (maybe externally computed) statistics to the .obs
field, and then move these to the .X
field.
Quick clarification, what is in the awkward arrays (I guess not a PCA embedding as in my comment above), which would be in the .obm
field that is still valid?
Thanks for the example - so to my understanding you add some (maybe externally computed) statistics to the
.obs
field, and then move these to the.X
field.Quick clarification, what is in the awkward arrays (I guess not a PCA embedding as in my comment above), which would be in the
.obm
field that is still valid?
I think this diagram will help you understand what I want to do. Please kindly have a look at it.
In adata, each row represents a patient's visit.
In .obs
, there are all the episode level features. And they have only a single value and without any time information, like height/age/gender.
In .obsm
, there are all time series features stored in awkward array. They have multiple records and have both time information and values. For example, adata.obsm['heart rate'][0] = awkward_array(time: [0, 1, 2, 4, 5, 7], value: [99, 80, 78, 91, 70, 90])
In this settings, I can easily move data from .obs
to .X using move_to_x
we are talking about and I can also move .obsm
to .X
using some aggregations. And if I change .X
, .obsm
will still be valid as it stores data irrelevant to .X
Thanks for the diagram, I think I get the point here - it is data in the .obsm in your case, not a computed result. The .obsm field in the scRNA-seq setting is often used for computed results, e.g. most things that produce an embedding of some sort. In ehrapy, also many tools compute an embedding and store that in the .obsm field.
It probably makes sense to not mix things computed from .X with raw data together in .obsm? Although expert users like you @xinyuejohn ofc can use this structure if helpful ;)
We have yet to include time series support in a meaningful way - a data field alike your use of .obsm is one of the candidates for sure.
Whatever this might look like, copying all the -data- as you suggest here should happen indeed.
I think I would refrain from copying .obsm when doing move_to_x
, as in many cases this field will be of results computed from .X
What do you think @xinyuejohn ?
I also think that it's weird if it happens automatically, but maybe we can just parameterize this and well document the use-case?
That would be .obsm
, .uns
, .obsp
you'd consider taking in?
Think so ya? Or just obms
and uns
?!?
So what do we do with this PR?
I'd be in favor of not having these values copied now, but storing this data later more consistently.
But I'm also fine having options for copying .obsm, .uns, .obsp (while setting the default to False)
Okay, then let's make these options if it makes it easier for ehrdata. That can also depend on an evaluation of @eroell
I'd close this PR for now, as I think the underlying outline of data handling patterns will not be the way to go
- the .obsm field is likely not a suitable place to store data of time-series variables, as it lacks the .var field for many operations
- moving .obsm to .X fails when .obsm is 3D arrays (.X must be 2D), which is what .obsm would often be if time series were to be stored here