anndata icon indicating copy to clipboard operation
anndata copied to clipboard

Datetimes

Open ivirshup opened this issue 4 years ago • 11 comments

Basic datetime IO support.

Currently this converts everything to numpy datetime arrays at write time. I'm not preserving pandas array types since there are multiple seemingly overlapping ways to deal with datetimes in pandas. This implementation also does not support time zones but that would be easy to add.

It would be good to get someone working with time series data to try this out and see if it meets their needs.

(I thought this would solve #455, but now see that was for datetime scalars which this does not currently support)

ivirshup avatar Jan 14 '22 15:01 ivirshup

Codecov Report

Merging #684 (7907db0) into master (a5727a5) will increase coverage by 0.04%. The diff coverage is 93.93%.

@@            Coverage Diff             @@
##           master     #684      +/-   ##
==========================================
+ Coverage   83.12%   83.16%   +0.04%     
==========================================
  Files          34       34              
  Lines        4396     4419      +23     
==========================================
+ Hits         3654     3675      +21     
- Misses        742      744       +2     
Impacted Files Coverage Δ
anndata/_io/specs/methods.py 84.14% <91.30%> (+0.44%) :arrow_up:
anndata/_io/specs/registry.py 91.48% <100.00%> (ø)

codecov[bot] avatar Jan 14 '22 15:01 codecov[bot]

It would be good to get someone working with time series data to try this out and see if it meets their needs.

we do technically. If we detect datetime we always just copy it to obs directly.

Zethson avatar Jan 14 '22 17:01 Zethson

An example (or you giving this branch a shot) would be great.

Do you have a way of saving these AnnData's at the moment?

ivirshup avatar Jan 14 '22 19:01 ivirshup

@imipenem can you help here?

Zethson avatar Jan 14 '22 20:01 Zethson

Do you have a way of saving these AnnData's at the moment?

At ehrapy, it just worked out of the box when writing these AnnDatas to .h5ad files. But this might be due to the fact, that we save columns with datetime values in obs only (and pandas treats these datetimes kind of different in comparison to numpy from what I've read), neither in uns or X. So we do not have any np.datetime values stored in the AnnData object at any time, which (IMO) fits our needs here (for now). So this would not affect us currently or do I miss something @Zethson?

Imipenem avatar Jan 14 '22 21:01 Imipenem

Thought so as well. We didn't run into any issues.

Zethson avatar Jan 14 '22 21:01 Zethson

I'm a little confused here. If I put any sorts of dates into obs, that anndata will fail to write to h5ad in 0.7.8.

Can you make an example of this? For me:

Failing example
import anndata as ad, pandas as pd, numpy as np
from vega_datasets import data
print(ad.__version__)
0.7.8
cars = data.cars()

dt_array = cars["Year"]
np_dt_array = dt_array.to_numpy()

N = np_dt_array.shape[0]
adata = ad.AnnData(X=np.ones((N, N)), obs=pd.DataFrame({"dt": dt_array}))

adata.write_h5ad("test_dt.h5ad")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    208         try:
--> 209             return func(elem, key, val, *args, **kwargs)
    210         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_array(f, key, value, dataset_kwargs)
    184         value = _to_hdf5_vlen_strings(value)
--> 185     f.create_dataset(key, data=value, **dataset_kwargs)
    186 

/usr/local/lib/python3.9/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    148 
--> 149             dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    150             dset = dataset.Dataset(dsid)

/usr/local/lib/python3.9/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
     90             dtype = numpy.dtype(dtype)
---> 91         tid = h5t.py_create(dtype, logical=1)
     92 

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: No conversion path for dtype: dtype('<M8[ns]')

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    208         try:
--> 209             return func(elem, key, val, *args, **kwargs)
    210         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_series(group, key, series, dataset_kwargs)
    288     else:
--> 289         write_array(group, key, series.values, dataset_kwargs=dataset_kwargs)
    290 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    211             parent = _get_parent(elem)
--> 212             raise type(e)(
    213                 f"{e}\n\n"

TypeError: No conversion path for dtype: dtype('<M8[ns]')

Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    208         try:
--> 209             return func(elem, key, val, *args, **kwargs)
    210         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_dataframe(f, key, df, dataset_kwargs)
    262     for col_name, (_, series) in zip(col_names, df.items()):
--> 263         write_series(group, col_name, series, dataset_kwargs=dataset_kwargs)
    264 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    211             parent = _get_parent(elem)
--> 212             raise type(e)(
    213                 f"{e}\n\n"

TypeError: No conversion path for dtype: dtype('<M8[ns]')

Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
/var/folders/bd/43q20k0n6z15tdfzxvd22r7c0000gn/T/ipykernel_4792/2332825967.py in <module>
----> 1 adata.write_h5ad("test_dt.h5ad")

~/github/anndata/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
   1910             filename = self.filename
   1911 
-> 1912         _write_h5ad(
   1913             Path(filename),
   1914             self,

~/github/anndata/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
    109         else:
    110             write_attribute(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
--> 111         write_attribute(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
    112         write_attribute(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
    113         write_attribute(f, "obsm", adata.obsm, dataset_kwargs=dataset_kwargs)

/usr/local/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/functools.py in wrapper(*args, **kw)
    875                             '1 positional argument')
    876 
--> 877         return dispatch(args[0].__class__)(*args, **kw)
    878 
    879     funcname = getattr(func, '__name__', 'singledispatch function')

~/github/anndata/anndata/_io/h5ad.py in write_attribute_h5ad(f, key, value, *args, **kwargs)
    128     if key in f:
    129         del f[key]
--> 130     _write_method(type(value))(f, key, value, *args, **kwargs)
    131 
    132 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    210         except Exception as e:
    211             parent = _get_parent(elem)
--> 212             raise type(e)(
    213                 f"{e}\n\n"
    214                 f"Above error raised while writing key {key!r} of {type(elem)}"

TypeError: No conversion path for dtype: dtype('<M8[ns]')

Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'obs' of <class 'h5py._hl.files.File'> from /.

ivirshup avatar Jan 17 '22 12:01 ivirshup

Sure:

import ehrapy.api as ep

adatas = ep.dt.mimic_3_demo(encoded=False, mudata=False)
print(adatas["INPUTEVENTS_CV"].obs)
adata = adatas["INPUTEVENTS_CV"]
# This may take 5-20 minutes
ep.pp.knn_impute(adata)
adata_encoded = ep.pp.encode(adata, autodetect=True)
ep.io.write("test.h5ad", adata_encoded)

I would not be surprised if we store things differently than you somewhere, but feel free to play around with it. I have the suspicion that the datetimes are somewhere just read as strings and then mapped to categoricals. They are not real datetimes. Feedback is always appreciated!

Zethson avatar Jan 17 '22 17:01 Zethson

I have the suspicion that the datetimes are somewhere just read as strings and then mapped to categoricals.

That seems to be the case.

adata_encoded.obs["charttime"].cat.categories.dtype
dtype('O')

Would it be useful if these were actual datetimes? The you could do things like ask how far apart the times were.

ivirshup avatar Jan 18 '22 11:01 ivirshup

I have the suspicion that the datetimes are somewhere just read as strings and then mapped to categoricals.

That seems to be the case.

adata_encoded.obs["charttime"].cat.categories.dtype
dtype('O')

Would it be useful if these were actual datetimes? The you could do things like ask how far apart the times were.

Not surprised. Our primary motivation was the coloring of plots and things like that.

Yeah, your suggested use-case is a good one. Although, in general I am trying to reduce the dependency on real time as much as possible with ehrapy and to work more with pseudotime :)

Zethson avatar Jan 18 '22 11:01 Zethson

@ivirshup is this PR still one approach that you'd follow or did it change since Pandas 2.0 got released? Datetime support would still be great for ehrapy - especially for stuff like comparing them and more

Zethson avatar Aug 21 '23 13:08 Zethson