Datetimes
Basic datetime IO support.
Currently this converts everything to numpy datetime arrays at write time. I'm not preserving pandas array types since there are multiple seemingly overlapping ways to deal with datetimes in pandas. This implementation also does not support time zones but that would be easy to add.
It would be good to get someone working with time series data to try this out and see if it meets their needs.
(I thought this would solve #455, but now see that was for datetime scalars which this does not currently support)
Codecov Report
Merging #684 (7907db0) into master (a5727a5) will increase coverage by
0.04%. The diff coverage is93.93%.
@@ Coverage Diff @@
## master #684 +/- ##
==========================================
+ Coverage 83.12% 83.16% +0.04%
==========================================
Files 34 34
Lines 4396 4419 +23
==========================================
+ Hits 3654 3675 +21
- Misses 742 744 +2
| Impacted Files | Coverage Δ | |
|---|---|---|
| anndata/_io/specs/methods.py | 84.14% <91.30%> (+0.44%) |
:arrow_up: |
| anndata/_io/specs/registry.py | 91.48% <100.00%> (ø) |
It would be good to get someone working with time series data to try this out and see if it meets their needs.
we do technically. If we detect datetime we always just copy it to obs directly.
An example (or you giving this branch a shot) would be great.
Do you have a way of saving these AnnData's at the moment?
@imipenem can you help here?
Do you have a way of saving these AnnData's at the moment?
At ehrapy, it just worked out of the box when writing these AnnDatas to .h5ad files. But this might be due to the fact, that we save columns with datetime values in obs only (and pandas treats these datetimes kind of different in comparison to numpy from what I've read), neither in uns or X. So we do not have any np.datetime values stored in the AnnData object at any time, which (IMO) fits our needs here (for now). So this would not affect us currently or do I miss something @Zethson?
Thought so as well. We didn't run into any issues.
I'm a little confused here. If I put any sorts of dates into obs, that anndata will fail to write to h5ad in 0.7.8.
Can you make an example of this? For me:
Failing example
import anndata as ad, pandas as pd, numpy as np
from vega_datasets import data
print(ad.__version__)
0.7.8
cars = data.cars()
dt_array = cars["Year"]
np_dt_array = dt_array.to_numpy()
N = np_dt_array.shape[0]
adata = ad.AnnData(X=np.ones((N, N)), obs=pd.DataFrame({"dt": dt_array}))
adata.write_h5ad("test_dt.h5ad")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
208 try:
--> 209 return func(elem, key, val, *args, **kwargs)
210 except Exception as e:
~/github/anndata/anndata/_io/h5ad.py in write_array(f, key, value, dataset_kwargs)
184 value = _to_hdf5_vlen_strings(value)
--> 185 f.create_dataset(key, data=value, **dataset_kwargs)
186
/usr/local/lib/python3.9/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
148
--> 149 dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
150 dset = dataset.Dataset(dsid)
/usr/local/lib/python3.9/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
90 dtype = numpy.dtype(dtype)
---> 91 tid = h5t.py_create(dtype, logical=1)
92
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
TypeError: No conversion path for dtype: dtype('<M8[ns]')
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
208 try:
--> 209 return func(elem, key, val, *args, **kwargs)
210 except Exception as e:
~/github/anndata/anndata/_io/h5ad.py in write_series(group, key, series, dataset_kwargs)
288 else:
--> 289 write_array(group, key, series.values, dataset_kwargs=dataset_kwargs)
290
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
211 parent = _get_parent(elem)
--> 212 raise type(e)(
213 f"{e}\n\n"
TypeError: No conversion path for dtype: dtype('<M8[ns]')
Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
208 try:
--> 209 return func(elem, key, val, *args, **kwargs)
210 except Exception as e:
~/github/anndata/anndata/_io/h5ad.py in write_dataframe(f, key, df, dataset_kwargs)
262 for col_name, (_, series) in zip(col_names, df.items()):
--> 263 write_series(group, col_name, series, dataset_kwargs=dataset_kwargs)
264
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
211 parent = _get_parent(elem)
--> 212 raise type(e)(
213 f"{e}\n\n"
TypeError: No conversion path for dtype: dtype('<M8[ns]')
Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.
Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
/var/folders/bd/43q20k0n6z15tdfzxvd22r7c0000gn/T/ipykernel_4792/2332825967.py in <module>
----> 1 adata.write_h5ad("test_dt.h5ad")
~/github/anndata/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
1910 filename = self.filename
1911
-> 1912 _write_h5ad(
1913 Path(filename),
1914 self,
~/github/anndata/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
109 else:
110 write_attribute(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
--> 111 write_attribute(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
112 write_attribute(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
113 write_attribute(f, "obsm", adata.obsm, dataset_kwargs=dataset_kwargs)
/usr/local/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/functools.py in wrapper(*args, **kw)
875 '1 positional argument')
876
--> 877 return dispatch(args[0].__class__)(*args, **kw)
878
879 funcname = getattr(func, '__name__', 'singledispatch function')
~/github/anndata/anndata/_io/h5ad.py in write_attribute_h5ad(f, key, value, *args, **kwargs)
128 if key in f:
129 del f[key]
--> 130 _write_method(type(value))(f, key, value, *args, **kwargs)
131
132
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
210 except Exception as e:
211 parent = _get_parent(elem)
--> 212 raise type(e)(
213 f"{e}\n\n"
214 f"Above error raised while writing key {key!r} of {type(elem)}"
TypeError: No conversion path for dtype: dtype('<M8[ns]')
Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.
Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.
Above error raised while writing key 'obs' of <class 'h5py._hl.files.File'> from /.
Sure:
import ehrapy.api as ep
adatas = ep.dt.mimic_3_demo(encoded=False, mudata=False)
print(adatas["INPUTEVENTS_CV"].obs)
adata = adatas["INPUTEVENTS_CV"]
# This may take 5-20 minutes
ep.pp.knn_impute(adata)
adata_encoded = ep.pp.encode(adata, autodetect=True)
ep.io.write("test.h5ad", adata_encoded)
I would not be surprised if we store things differently than you somewhere, but feel free to play around with it. I have the suspicion that the datetimes are somewhere just read as strings and then mapped to categoricals. They are not real datetimes. Feedback is always appreciated!
I have the suspicion that the datetimes are somewhere just read as strings and then mapped to categoricals.
That seems to be the case.
adata_encoded.obs["charttime"].cat.categories.dtype
dtype('O')
Would it be useful if these were actual datetimes? The you could do things like ask how far apart the times were.
I have the suspicion that the datetimes are somewhere just read as strings and then mapped to categoricals.
That seems to be the case.
adata_encoded.obs["charttime"].cat.categories.dtypedtype('O')Would it be useful if these were actual datetimes? The you could do things like ask how far apart the times were.
Not surprised. Our primary motivation was the coloring of plots and things like that.
Yeah, your suggested use-case is a good one. Although, in general I am trying to reduce the dependency on real time as much as possible with ehrapy and to work more with pseudotime :)
@ivirshup is this PR still one approach that you'd follow or did it change since Pandas 2.0 got released? Datetime support would still be great for ehrapy - especially for stuff like comparing them and more