anndataR Hdf5r rownames rework

Hdf5r rownames rework

Open rcannood opened this issue 4 months ago • 2 comments

Hey @lazappi !

Since the lack of support for boolean attributes remains a blocking issue with rhdf5, I decided to give hdf5r a try.

The current PR seems to work a lot better, still has some issues when writing h5ad files with anndataR and then trying to read them out again in Python anndata. I need to do some experiments with writing the same h5ad file from anndataR and Python anndata and seeing where the differences lie. I'm starting to think that in our implementation of hdf5_write_compressed, the dtype and space should not be guessed but instead be manually specified depending on which write_h5ad_* function it was called from.

Luckily, our internal read_*/write_* functions stayed pretty much the same since all rhdf5::* could mostly be substituted with the corresponding hdf5r::* functions.

While making the changes, I was struggling with our decision to keep the obs_names / var_names separate from obs and var, because they are stored inside the obs and var and when making changes to the obs and var the first thing we do is throw it away.

By allowing the rownames of the obs and var to be the obs_names and var_names, the code did get simplified a lot.

hdf5_write_compressed <- function(file, name, value, compression = c("none", "gzip", "lzf"), scalar = FALSE) {
  compression <- match.arg(compression)
  if (!is.null(dim(value))) {
    dims <- dim(value)
  } else {
    dims <- length(value)
  }
  dtype <- hdf5r::guess_dtype(value, scalar = scalar, string_len = Inf)
  space <- hdf5r::guess_space(value, dtype = dtype, chunked = FALSE)

  # TODO: lzf compression is not supported in hdf5r
  # TODO: allow the user to choose compression level
  gzip_level <- if (compression == "none") 0 else 9

  out <- file$create_dataset(
    name = name,
    dims = dims,
    gzip_level = gzip_level,
    robj = value,
    chunk_dims = NULL,
    space = space,
    dtype = dtype
  )
  # todo: create attr?

  out
}

There is currently an issue with the released version of hdf5r (hhoeflin/hdf5r#208) which was the cause of some of the strange errors in packages like MuDataSeurat. We already managed to fix the issue, but it still needs to be merged into the main branch and released.

Feb 10 '24 07:02 rcannood

I think there was a reason we decided obs_names/var_names needed to be stored separately. I can't remember all the details but I think it had something to do with the fact that what you can store in rownames of a data.frame is pretty limited and it is possible that there are things in a file written from Python that can't be stored that way. Not sure if we ran into actual issues or if it was just a theoretical problem.

I didn't write any of the code for handling compression so I'm not sure about that. I could maybe look if needed but otherwise I don't really have an opinion there.

For me, switching from {rhfd5} to {hdf5r} would be a pretty major change, particularly as it affects if we submit to CRAN/Bioconductor. I'm not entirely opposed but I would want to better understand the differences and pro/cons first. We could probably reach out to the maintainers to try and get the boolean attributes added to {rhdf5} if that's the motivation for switching.

Feb 12 '24 14:02 lazappi

I asked about this and turns out we actually need to write an ENUM attribute not a special boolean thing. This still requires some changes to {rhdf5} but I think relatively minor.

In the process I realised how we currently write boolean values anywhere is wrong and we need to replace it with the ENUM approach. That's my mistake because I didn't know enough about how HDF5 works. Should be relatively easy to fix once the changes in {rhdf5} are settled. Maybe {hdf5r} already does it this way though?

Feb 13 '24 10:02 lazappi

anndataR anndataR copied to clipboard

Hdf5r rownames rework

anndataR
anndataR copied to clipboard