pycisTopic icon indicating copy to clipboard operation
pycisTopic copied to clipboard

Bug report [BUG]: Unable to export region accessibility to loom TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Open jjia1 opened this issue 2 months ago • 4 comments

Describe the bug I'm running into the following error when attempting to export the differentially accessible regions to a loom file. TypeError: Object dtype dtype('O') has no native HDF5 equivalent

To Reproduce

  1. Generated imputed_acc_obj from the polars_1xx branch commands
  2. data is my pyCistopic object
  3. binarized_region_dict and binarized_cell_dict are from the following commands since their original output was changed in the polars_1xx branch.
# --- 1) REGIONS: polars -> main style dict[str -> DataFrame] ---
def binarized_regions_main_style(
    data,
    method: str = "otsu",
    smooth_topics: bool = True,
    nbins: int = 100,
):
    # Use the model’s topic-region matrix to ensure topic order and names match export
    topic_region: pd.DataFrame = data.selected_model.topic_region  # (regions x K)
    topic_labels = [str(c) for c in topic_region.columns]         # ["Topic1", ...]
    region_names = topic_region.index.astype(str).tolist()         # item names

    # polars_1xx binarizer expects items x topics
    prob_mat = topic_region.values  # NOTE: polars will do smoothing internally if smooth_topics=True

    names_per_topic, scores_per_topic, thresholds = binarize_topics(
        cell_or_region_topic_prob=prob_mat,
        cell_or_region_names=region_names,
        method=method,
        smooth_topics=smooth_topics,
        nbins=nbins,
    )

    # Build DataFrames with index = region names, single column = topic label, sorted desc by score
    reg_dict = {
        lab: pd.DataFrame(
                {lab: pd.Series(scores_per_topic[i], index=names_per_topic[i])}
            ).sort_values(lab, ascending=False)
        for i, lab in enumerate(topic_labels)
    }
    return reg_dict, thresholds

# --- 2) CELLS: polars -> main style dict[str -> DataFrame] ---
def binarized_cells_main_style(
    data,
    method: str = "li",
    smooth_topics: bool = True,
    nbins: int = 100,
):
    # The model stores cell_topic as (topics x cells); we need cells x topics
    cell_topic: pd.DataFrame = data.selected_model.cell_topic
    topic_labels = [str(r) for r in cell_topic.index]           # ["Topic1", ...]
    cell_names = cell_topic.columns.astype(str).tolist()
    cell_prob = cell_topic.T.values  # (n_cells x K)

    names_per_topic, scores_per_topic, thresholds = binarize_topics(
        cell_or_region_topic_prob=cell_prob,
        cell_or_region_names=cell_names,
        method=method,
        smooth_topics=smooth_topics,
        nbins=nbins,
    )

    cell_dict = {
        lab: pd.DataFrame(
                {lab: pd.Series(scores_per_topic[i], index=names_per_topic[i])}
            ).sort_values(lab, ascending=False)
        for i, lab in enumerate(topic_labels)
    }
    return cell_dict, thresholds

# --- 3) SANITIZE: ensure region names match imputed features; drop empty topics to avoid IndexError ---
def sanitize_for_export(
    binarized_region_dict: dict[str, pd.DataFrame],
    binarized_cell_dict: dict[str, pd.DataFrame],
    data,
    imputed_acc_obj,
):
    feature_names = pd.Index(map(str, imputed_acc_obj.feature_names))
    regulon_mat: pd.DataFrame = data.selected_model.topic_region
    regulon_mat.index   = regulon_mat.index.astype(str)
    regulon_mat.columns = [str(c) for c in regulon_mat.columns]

    # Intersect per-topic region lists with both universes
    cleaned_regions = {}
    dropped_topics = []
    for t, df in binarized_region_dict.items():
        # Keep rows present in both the imputed matrix and the model’s region index
        idx = pd.Index(df.index.astype(str))
        idx = idx.intersection(feature_names).intersection(regulon_mat.index)
        if len(idx) == 0:
            dropped_topics.append(t)
            continue
        # Re-sort by the **model’s** (unsmoothed) topic score so the last row is the “cutoff” row
        s = regulon_mat.loc[idx, t].sort_values(ascending=False)
        cleaned_regions[t] = pd.DataFrame({t: s})

    # Drop the same topics on the cell side, then filter cells to the ones in the export
    for t in dropped_topics:
        binarized_cell_dict.pop(t, None)

    # Optional: filter cells to the cells that will be in the loom (selected_cells) later;
    # exporter will do an intersection too, so this is mostly for symmetry.

    # Final key check
    common = set(cleaned_regions.keys()) & set(binarized_cell_dict.keys())
    cleaned_regions = {k: cleaned_regions[k] for k in sorted(common)}
    cleaned_cells   = {k: binarized_cell_dict[k] for k in sorted(common)}

    # Debug print (helps when a topic silently empties out)
    print("Topics kept:", len(common), "dropped:", dropped_topics)

    # Basic asserts to catch shape mistakes early
    assert len(cleaned_regions) > 0, "All topics empty after name harmonization."
    for t in cleaned_regions:
        assert isinstance(cleaned_regions[t], pd.DataFrame)
        assert cleaned_regions[t].shape[1] == 1
        assert cleaned_regions[t].columns[0] == t
        assert cleaned_regions[t].index.isin(feature_names).any()

    return cleaned_regions, cleaned_cells

# 1) Build main-style binarized dicts from polars binarizer
binarized_region_dict, thr_reg = binarized_regions_main_style(
    data, method="otsu", smooth_topics=True, nbins=100
)
binarized_cell_dict, thr_cell = binarized_cells_main_style(
    data, method="li", smooth_topics=True, nbins=100
)

# 2) Sanitize to match imputed features & model index; drop empty topics
binarized_region_dict, binarized_cell_dict = sanitize_for_export(
    binarized_region_dict, binarized_cell_dict, data, imputed_acc_obj
)

This is the command I ran.

export_region_accessibility_to_loom(
    accessibility_matrix = imputed_acc_obj,
    cistopic_obj = data,
    binarized_topic_region = binarized_region_dict,
    binarized_cell_topic = binarized_cell_dict,
    selected_cells = data.projections['cell']['UMAP'].index.tolist(),
    out_fname = os.path.join(out_dir, "loom", "pycisTopic_region_accessibility20250905.loom"),
    cluster_annotation = ['cell_type_clean'],
    # cluster_markers = cluster_markers,
    tree_structure = ('atac_mouse_brain', 'pycisTopic', 'noDBL_all'),
    title = 'Region accessibility all',
    nomenclature = "mm39",
    split_pattern = '-'
)

Error output TypeError: Object dtype dtype('O') has no native HDF5 equivalent

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <timed eval>:1

File ~/.conda/envs/scenicplus_working/lib/python3.11/site-packages/pycisTopic/loom.py:675, in export_region_accessibility_to_loom(accessibility_matrix, cistopic_obj, binarized_topic_region, binarized_cell_topic, out_fname, selected_regions, selected_cells, cluster_annotation, cluster_markers, tree_structure, title, nomenclature, split_pattern, **kwargs)
    673 # Create minimal loom
    674 log.info("Creating minimal loom")
--> 675 export_minimal_loom_region(
    676     ex_mtx=ex_mtx,
    677     cell_names=cell_names,
    678     feature_names=feature_names,
    679     out_fname=out_fname,
    680     regulons=regulon_mat,
    681     cell_annotations=None,
    682     tree_structure=tree_structure,
    683     title=title,
    684     nomenclature=nomenclature,
    685     embeddings=embeddings,
    686     auc_mtx=cell_topic,
    687     auc_thresholds=topic_thresholds,
    688 )
    690 # Add annotations
    691 log.info("Adding annotations")

File ~/.conda/envs/scenicplus_working/lib/python3.11/site-packages/pycisTopic/loom.py:890, in export_minimal_loom_region(ex_mtx, cell_names, feature_names, out_fname, regulons, cell_annotations, tree_structure, title, nomenclature, num_workers, embeddings, auc_mtx, auc_thresholds, compress)
    887     general_attrs["MetaData"] = compress_encode(value=general_attrs["MetaData"])
    889 # Create loom file for use with the SCope tool.
--> 890 lp.create(
    891     filename=out_fname,
    892     layers=ex_mtx,
    893     row_attrs=row_attrs,
    894     col_attrs=column_attrs,
    895     file_attrs=general_attrs,
    896 )

File ~/.conda/envs/scenicplus_working/lib/python3.11/site-packages/loompy/loompy.py:1076, in create(filename, layers, row_attrs, col_attrs, file_attrs)
   1073 			ds.ra[key] = vals
   1075 		for key, vals in col_attrs.items():
-> 1076 			ds.ca[key] = vals
   1078 except ValueError as ve:
   1079 	if os.path.exists(filename):

File ~/.conda/envs/scenicplus_working/lib/python3.11/site-packages/loompy/attribute_manager.py:129, in AttributeManager.__setitem__(self, name, val)
    125 def __setitem__(self, name: str, val: np.ndarray) -> None:
    126 	"""
    127 	Set the value of a named attribute
    128 	"""
--> 129 	return self.__setattr__(name, val)

File ~/.conda/envs/scenicplus_working/lib/python3.11/site-packages/loompy/attribute_manager.py:156, in AttributeManager.__setattr__(self, name, val)
    153 if self.ds._file[a].__contains__(name):
    154 	del self.ds._file[a + name]
--> 156 self.ds._file.create_dataset(
    157 	a + name,
    158 	data=values,
    159 	dtype=h5py.special_dtype(vlen=str) if values.dtype == np.object_ else values.dtype,
    160 	maxshape=(values.shape[0], ) if len(values.shape) == 1 else (values.shape[0], None),
    161 	fletcher32=False,
    162 	compression="gzip",
    163 	shuffle=False,
    164 	compression_opts=2
    165 )
    166 self.ds._file[a + name].attrs["last_modified"] = timestamp()
    167 self.ds._file[a].attrs["last_modified"] = timestamp()

File ~/.conda/envs/scenicplus_working/lib/python3.11/site-packages/h5py/_hl/group.py:186, in Group.create_dataset(self, name, shape, dtype, data, **kwds)
    183         parent_path, name = name.rsplit(b'/', 1)
    184         group = self.require_group(parent_path)
--> 186 dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    187 dset = dataset.Dataset(dsid)
    188 return dset

File ~/.conda/envs/scenicplus_working/lib/python3.11/site-packages/h5py/_hl/dataset.py:88, in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, dapl, efile_prefix, virtual_prefix, allow_unknown_filter, rdcc_nslots, rdcc_nbytes, rdcc_w0, fill_time)
     86     else:
     87         dtype = numpy.dtype(dtype)
---> 88     tid = h5t.py_create(dtype, logical=1)
     90 # Legacy
     91 if any((compression, shuffle, fletcher32, maxshape, scaleoffset)) and chunks is False:

File h5py/h5t.pyx:1680, in h5py.h5t.py_create()

File h5py/h5t.pyx:1704, in h5py.h5t.py_create()

File h5py/h5t.pyx:1731, in h5py.h5t.py_create()

File h5py/h5t.pyx:1637, in h5py.h5t._c_compound()

File h5py/h5t.pyx:1704, in h5py.h5t.py_create()

File h5py/h5t.pyx:1771, in h5py.h5t.py_create()

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Version (please complete the following information):

  • Python 3.11.8
  • Polars 1xx branch

Additional Information I did some additional checks of data types in both data.cell_data and data.region_data with the following lines:

import pandas as pd
import numpy as np

# Define which columns should be what type
categorical_columns = [
    'sample_id', 'sample', 'batch', 'barcode', 'og_bc',
    'cell_type', 'cell_type_clean',
    'pycisTopic_leiden_10_0.6', 'pycisTopic_leiden_10_1.2', 'pycisTopic_leiden_10_3',
    'treatment', 'replicate', 'treatment_time', 'treatment_time_aggcontrol', 'ct_trt',
    'pycisTopic_harmony_celltype_lvl3', 'pycisTopic_bbknn_celltype_lvl3'
]

float_columns = [
    'cisTopic_nr_acc', 'cisTopic_nr_frag', 'cisTopic_log_nr_frag', 'cisTopic_log_nr_acc',
    'Doublet_scores_fragments', 'Predicted_doublets_fragments'
]

# Convert to appropriate types
for col in data.cell_data.columns:
    if col in categorical_columns:
        # Convert to categorical
        print(f"Converting {col} to categorical...")
        data.cell_data[col] = pd.Categorical(data.cell_data[col])
    elif col in float_columns:
        # Convert to float32
        print(f"Converting {col} to float32...")
        data.cell_data[col] = pd.to_numeric(data.cell_data[col], errors='coerce').astype('float32')

# Verify the conversions
print("\nVerified data types:")
for col in data.cell_data.columns:
    print(f"{col}: {data.cell_data[col].dtype}")

# Check for any other object columns in region_data
print("\nChecking all region_data columns:")
for col in data.region_data.columns:
    print(f"  {col}: {data.region_data[col].dtype}")

which gave the following outputs:

Converting sample_id to categorical...
Converting cisTopic_nr_acc to float32...
Converting cisTopic_nr_frag to float32...
Converting cisTopic_log_nr_frag to float32...
Converting og_bc to categorical...
Converting cisTopic_log_nr_acc to float32...
Converting sample to categorical...
Converting batch to categorical...
Converting barcode to categorical...
Converting cell_type to categorical...
Converting Doublet_scores_fragments to float32...
Converting Predicted_doublets_fragments to float32...
Converting pycisTopic_leiden_10_0.6 to categorical...
Converting pycisTopic_leiden_10_1.2 to categorical...
Converting pycisTopic_leiden_10_3 to categorical...
Converting treatment to categorical...
Converting replicate to categorical...
Converting treatment_time to categorical...
Converting treatment_time_aggcontrol to categorical...
Converting ct_trt to categorical...
Converting cell_type_clean to categorical...

Verified data types:
duplication_ratio: float64
total_fragments_in_peaks_count: int32
sample_id: category
cisTopic_nr_acc: float32
cisTopic_nr_frag: float32
pdf_values_for_fraction_of_fragments_in_peaks: float64
log10_unique_fragments_in_peaks_count: float64
duplication_count: int64
cisTopic_log_nr_frag: float32
og_bc: category
unique_fragments_count: uint32
pdf_values_for_duplication_ratio: float64
fraction_of_fragments_in_peaks: float64
cisTopic_log_nr_acc: float32
barcode_rank: uint32
total_fragments_count: int32
log10_total_fragments_count: float64
pdf_values_for_tss_enrichment: float64
log10_unique_fragments_count: float64
unique_fragments_in_peaks_count: uint32
tss_enrichment: float64
log10_total_fragments_in_peaks_count: float64
nucleosome_signal: float64
sample: category
batch: category
barcode: category
n_genes: float64
cell_type: category
Doublet_scores_fragments: float32
Predicted_doublets_fragments: float32
pycisTopic_leiden_10_0.6: category
pycisTopic_leiden_10_1.2: category
pycisTopic_leiden_10_3: category
timepoint: Int64
treatment: category
replicate: category
treatment_time: category
treatment_time_aggcontrol: category
ct_trt: category
cell_type_clean: category


Checking all region_data columns:
  Chromosome: object
  Start: int32
  End: int32
  Width: int32
  cisTopic_nr_frag: int64
  cisTopic_log_nr_frag: float64
  cisTopic_nr_acc: int64
  cisTopic_log_nr_acc: float64

I've tried fixing the Chromosome region_data column into a categorical string column, but I'm still running into the h5py error. If possible, would like some help getting around this issue. I'm able to export the gene activity to loom but keep getting this error for differentially accessible regions. Thanks!

jjia1 avatar Sep 12 '25 15:09 jjia1

It might choke on:

timepoint: Int64

https://pandas.pydata.org/docs/reference/api/pandas.arrays.IntegerArray.html#pandas.arrays.IntegerArray https://pandas.pydata.org/docs/user_guide/integer_na.html#integer-na

I am not a Pandas user, but as far as I know IntX (with capital I) is a new Pandas datatype that allows Nulls (None) as values in integer arrays. This is different from the old intX datatype which only allows integer values. IntX might be seen by h5py as object dtype.

ghuls avatar Sep 15 '25 12:09 ghuls

@ghuls I see your point. I'll check for any null values or just coercing into regular int64 rather than Int64. On the other hand, I did try saving a subset of the object and that worked. I also broke the object into batches and saved each batch separately successful, making me think it could be related to some memory issue, which is also strange because I'm not seeing my RAM being maxed out during the export process.

jjia1 avatar Sep 15 '25 18:09 jjia1

Checking for None values and converted the Int64 to int64 but ran into the same error. I'm assuming it's something happening on the backend for loompy.

jjia1 avatar Sep 15 '25 18:09 jjia1

Probably downgrading Pandas to 1.5.3 (<2.0) and numpy to 1.26.4 (<2.0.0) might help.

ghuls avatar Sep 26 '25 11:09 ghuls