ehrapy icon indicating copy to clipboard operation
ehrapy copied to clipboard

improve move_to_x

Open xinyuejohn opened this issue 11 months ago • 12 comments

PR Checklist

  • [ ] This comment contains a description of changes (with reason)
  • [ ] Referenced issue is linked
  • [ ] If you've fixed a bug or added code that should be tested, add tests!
  • [ ] Documentation in docs is updated

Description of changes Add three arguments to move_to_x: copy_uns, copy_obsm, and copy_varm and they are all set to be True and optional.

Technical details Tests are already been added.

Additional context

xinyuejohn avatar Mar 04 '24 08:03 xinyuejohn

A few points on this:

  • I wonder if it is a good idea to store computed results (e.g. adata.obsm["X_pca"]) together with altered data that would actually yield different results if one would recompute. Is there a striking example justifying to do that?
  • If .obsm is copied, .obsp might be as well following this logic I think?
  • Copying .varm should be omitted in any case my opinion; This contains per-feature computed results. By doing move_to_x, a new feature is actually added. In addition to the thoughts about changed data above, here the dimensionality of the data does not match anymore. Example below of how this pretty quickly fails :)
import ehrapy as ep
adata = ep.dt.diabetes_130(columns_obs_only=["race", "gender", "age"])
adata_prep = adata[
    :,
    (adata.var.index == "time_in_hospital_days")
    | (adata.var.index == "num_lab_procedures")
    | (adata.var.index == "num_procedures")
    | (adata.var.index == "num_medications")
    | (adata.var.index == "number_diagnoses"),
]
ep.pp.pca(adata_prep, n_comps=2)
adata_prep.varm["PCs"]
adata_prep_2 = ep.ad.move_to_x(adata_prep, "gender", copy_varm=True)
ValueError: Value passed for key 'PCs' is of incorrect shape. Values of varm must match dimensions ('var',) of parent. Value had shape (5,) while it should have had (6,).

Considering that move_to_x creates somewhat a new dataset, it might be justified and/or necessary to (re)compute things?

eroell avatar Mar 05 '24 13:03 eroell

@xinyuejohn could you also outline for us again for what exactly you need this feature, please? I do totally see @eroell points

Zethson avatar Mar 05 '24 13:03 Zethson

A few points on this:

  • I wonder if it is a good idea to store computed results (e.g. adata.obsm["X_pca"]) together with altered data that would actually yield different results if one would recompute. Is there a striking example justifying to do that?
  • If .obsm is copied, .obsp might be as well following this logic I think?
  • Copying .varm should be omitted in any case my opinion; This contains per-feature computed results. By doing move_to_x, a new feature is actually added. In addition to the thoughts about changed data above, here the dimensionality of the data does not match anymore. Example below of how this pretty quickly fails :)
import ehrapy as ep
adata = ep.dt.diabetes_130(columns_obs_only=["race", "gender", "age"])
adata_prep = adata[
    :,
    (adata.var.index == "time_in_hospital_days")
    | (adata.var.index == "num_lab_procedures")
    | (adata.var.index == "num_procedures")
    | (adata.var.index == "num_medications")
    | (adata.var.index == "number_diagnoses"),
]
ep.pp.pca(adata_prep, n_comps=2)
adata_prep.varm["PCs"]
adata_prep_2 = ep.ad.move_to_x(adata_prep, "gender", copy_varm=True)
ValueError: Value passed for key 'PCs' is of incorrect shape. Values of varm must match dimensions ('var',) of parent. Value had shape (5,) while it should have had (6,).

Considering that move_to_x creates somewhat a new dataset, it might be justified and/or necessary to (re)compute things?

Thanks for your reply and I totally got your points!

Here's my use case:

adata.obs = pd.merge(adata.obs, df_statistics, how="left", left_index=True, right_index=True)
adata = ep.ad.move_to_x(adata, list(df_statistics.columns))
adata.obsm = obsm

In .obsm, there are some awkward arrays which still valid as long as the length of observations keep same.

xinyuejohn avatar Mar 07 '24 10:03 xinyuejohn

Thanks for the example - so to my understanding you add some (maybe externally computed) statistics to the .obs field, and then move these to the .X field.

Quick clarification, what is in the awkward arrays (I guess not a PCA embedding as in my comment above), which would be in the .obm field that is still valid?

eroell avatar Mar 08 '24 10:03 eroell

Thanks for the example - so to my understanding you add some (maybe externally computed) statistics to the .obs field, and then move these to the .X field.

Quick clarification, what is in the awkward arrays (I guess not a PCA embedding as in my comment above), which would be in the .obm field that is still valid?

I think this diagram will help you understand what I want to do. Please kindly have a look at it. image

In adata, each row represents a patient's visit. In .obs, there are all the episode level features. And they have only a single value and without any time information, like height/age/gender. In .obsm, there are all time series features stored in awkward array. They have multiple records and have both time information and values. For example, adata.obsm['heart rate'][0] = awkward_array(time: [0, 1, 2, 4, 5, 7], value: [99, 80, 78, 91, 70, 90])

In this settings, I can easily move data from .obs to .X using move_to_x we are talking about and I can also move .obsm to .X using some aggregations. And if I change .X, .obsm will still be valid as it stores data irrelevant to .X

xinyuejohn avatar Mar 11 '24 09:03 xinyuejohn

Thanks for the diagram, I think I get the point here - it is data in the .obsm in your case, not a computed result. The .obsm field in the scRNA-seq setting is often used for computed results, e.g. most things that produce an embedding of some sort. In ehrapy, also many tools compute an embedding and store that in the .obsm field.

It probably makes sense to not mix things computed from .X with raw data together in .obsm? Although expert users like you @xinyuejohn ofc can use this structure if helpful ;)

We have yet to include time series support in a meaningful way - a data field alike your use of .obsm is one of the candidates for sure.

Whatever this might look like, copying all the -data- as you suggest here should happen indeed.

I think I would refrain from copying .obsm when doing move_to_x, as in many cases this field will be of results computed from .X

What do you think @xinyuejohn ?

eroell avatar Mar 18 '24 13:03 eroell

I also think that it's weird if it happens automatically, but maybe we can just parameterize this and well document the use-case?

Zethson avatar Mar 18 '24 14:03 Zethson

That would be .obsm, .uns, .obsp you'd consider taking in?

eroell avatar Mar 18 '24 14:03 eroell

Think so ya? Or just obms and uns?!?

Zethson avatar Mar 19 '24 19:03 Zethson

So what do we do with this PR?

Zethson avatar Apr 18 '24 14:04 Zethson

I'd be in favor of not having these values copied now, but storing this data later more consistently.

But I'm also fine having options for copying .obsm, .uns, .obsp (while setting the default to False)

eroell avatar Apr 19 '24 11:04 eroell

Okay, then let's make these options if it makes it easier for ehrdata. That can also depend on an evaluation of @eroell

Zethson avatar May 14 '24 15:05 Zethson

I'd close this PR for now, as I think the underlying outline of data handling patterns will not be the way to go

  • the .obsm field is likely not a suitable place to store data of time-series variables, as it lacks the .var field for many operations
  • moving .obsm to .X fails when .obsm is 3D arrays (.X must be 2D), which is what .obsm would often be if time series were to be stored here

eroell avatar Aug 30 '24 13:08 eroell