anndataR
anndataR copied to clipboard
Hdf5r rownames rework
Hey @lazappi !
Since the lack of support for boolean attributes remains a blocking issue with rhdf5, I decided to give hdf5r a try.
The current PR seems to work a lot better, still has some issues when writing h5ad files with anndataR and then trying to read them out again in Python anndata. I need to do some experiments with writing the same h5ad file from anndataR and Python anndata and seeing where the differences lie. I'm starting to think that in our implementation of hdf5_write_compressed
, the dtype
and space
should not be guessed but instead be manually specified depending on which write_h5ad_*
function it was called from.
Luckily, our internal read_*
/write_*
functions stayed pretty much the same since all rhdf5::*
could mostly be substituted with the corresponding hdf5r::*
functions.
While making the changes, I was struggling with our decision to keep the obs_names
/ var_names
separate from obs
and var
, because they are stored inside the obs
and var
and when making changes to the obs
and var
the first thing we do is throw it away.
By allowing the rownames of the obs
and var
to be the obs_names
and var_names
, the code did get simplified a lot.
hdf5_write_compressed <- function(file, name, value, compression = c("none", "gzip", "lzf"), scalar = FALSE) {
compression <- match.arg(compression)
if (!is.null(dim(value))) {
dims <- dim(value)
} else {
dims <- length(value)
}
dtype <- hdf5r::guess_dtype(value, scalar = scalar, string_len = Inf)
space <- hdf5r::guess_space(value, dtype = dtype, chunked = FALSE)
# TODO: lzf compression is not supported in hdf5r
# TODO: allow the user to choose compression level
gzip_level <- if (compression == "none") 0 else 9
out <- file$create_dataset(
name = name,
dims = dims,
gzip_level = gzip_level,
robj = value,
chunk_dims = NULL,
space = space,
dtype = dtype
)
# todo: create attr?
out
}
There is currently an issue with the released version of hdf5r (hhoeflin/hdf5r#208) which was the cause of some of the strange errors in packages like MuDataSeurat. We already managed to fix the issue, but it still needs to be merged into the main branch and released.
I think there was a reason we decided obs_names
/var_names
needed to be stored separately. I can't remember all the details but I think it had something to do with the fact that what you can store in rownames
of a data.frame
is pretty limited and it is possible that there are things in a file written from Python that can't be stored that way. Not sure if we ran into actual issues or if it was just a theoretical problem.
I didn't write any of the code for handling compression so I'm not sure about that. I could maybe look if needed but otherwise I don't really have an opinion there.
For me, switching from {rhfd5} to {hdf5r} would be a pretty major change, particularly as it affects if we submit to CRAN/Bioconductor. I'm not entirely opposed but I would want to better understand the differences and pro/cons first. We could probably reach out to the maintainers to try and get the boolean attributes added to {rhdf5} if that's the motivation for switching.
I asked about this and turns out we actually need to write an ENUM
attribute not a special boolean thing. This still requires some changes to {rhdf5} but I think relatively minor.
In the process I realised how we currently write boolean values anywhere is wrong and we need to replace it with the ENUM
approach. That's my mistake because I didn't know enough about how HDF5 works. Should be relatively easy to fix once the changes in {rhdf5} are settled. Maybe {hdf5r} already does it this way though?