pycisTopic
pycisTopic copied to clipboard
pycisTopic find_diff_features function stopping abruptly at Starting a local Ray instance
Hello!
I am having issues with the find_diff_features function in the pycisTopic library as a part of the SCENIC+ workflow. I have not had any issues generating the previous, required dataframes in the workflow according to the live lecture aside from perhaps a few NaNs in the Seurat_cell_type column of my cistopic_obj. But for some reason when I run find_diff_features, Jupyter Notebook tells me that a local Ray instance has started but then cuts off right there without any additional processes, and I am left with an empty markers_dict. Is there something I can check to fix this? Here is the code I am running.
import ray
markers_dict= find_diff_features( cistopic_obj, imputed_acc_obj, variable='Seurat_cell_type', var_features=variable_regions, contrasts=None, adjpval_thr=0.05, log2fc_thr=np.log2(1.5), n_cpu=5, _temp_dir='/tmp', split_pattern = None )
Hi @ptk1601
Can you check if you are running out of memory perhaps? @ghuls has also been updating this code. It should run much smoother with his new version.
Best,
Seppe
Hi! I actually managed to figure this one out, turns out there was a discrepancy between barcode names that was causing this issue!
Hi @ptk1601,
Here's an example on finding variable features with the updated code from @ghuls. This code is faster, uses less memory and handles the imputed accessibility differently. The code for identifying differentially accessible regions (DARs) between groups is not compatible yet with this new implementation, but will be updated soon.
For now, it means that for DAR calculation, the original impute_accessibility function is still required. This function should only use the calculated variable regions, which already saves quite some memory and time compared to using the full region set.
EXAMPLE CODE
Requirements: https://github.com/aertslab/pycisTopic/tree/polars_1xx
## load object and initialise variables
from pycisTopic.cistopic_class import *
from pycisTopic.diff_features import *
import pickle
infile = open('cistopic_object.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
topic_region_ = cistopic_obj.selected_model.topic_region.to_numpy(dtype=np.float32)
cell_topic_ = cistopic_obj.selected_model.cell_topic.to_numpy(dtype=np.float32)
regions = cistopic_obj.selected_model.topic_region.index.tolist()
## Get mean and dispersion of normalized imputed accessibility per region.
(region_names_to_keep,
per_region_means_on_normalized_imputed_acc,
per_region_dispersions_on_normalized_imputed_acc,) =
calculate_per_region_mean_and_dispersion_on_normalized_imputed_acc(
region_topic=topic_region_,
cell_topic=cell_topic_,
region_names=regions,
scale_factor1 = 10**6,
scale_factor2 = 10**4,
regions_chunk_size=20000,)
## Optional: save output
import numpy as np
import pandas as pd
np.savez_compressed("impute_acc_per_region_mean.npz", arr=impute_acc_per_region_mean)
np.savez_compressed("impute_acc_per_region_dispersion.npz", arr=impute_acc_per_region_dispersion)
np.savez_compressed(" region_idx_to_keep.npz", arr=region_idx_to_keep)
pd.DataFrame(region_names_to_keep, columns=["regions"]).to_csv("region_names_to_keep.csv", index=False)
region_topic shape (1292285, 80)
2024-11-06 13:38:12,600 cisTopic INFO Calculate total imputed accessibility per cell.
2024-11-06 13:38:12,600 cisTopic INFO Calculate (partial) imputed accessibility per cell for regions 0-20000 (out of 1292285).
2024-11-06 13:38:12,601 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:38:21,890 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:38:33,077 cisTopic INFO - Only keep integer part.
2024-11-06 13:38:44,036 cisTopic INFO - Get non-zero regions.
2024-11-06 13:38:44,242 cisTopic INFO - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 13:39:19,566 cisTopic INFO Calculate (partial) imputed accessibility per cell for regions 20000-40000 (out of 1292285).
2024-11-06 13:39:19,567 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:39:28,928 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:39:40,103 cisTopic INFO - Only keep integer part.
2024-11-06 13:39:51,093 cisTopic INFO - Get non-zero regions.
2024-11-06 13:39:51,146 cisTopic INFO - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 13:40:25,185 cisTopic INFO Calculate (partial) imputed accessibility per cell for regions 40000-60000 (out of 1292285).
2024-11-06 13:40:25,185 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:40:34,535 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:40:45,140 cisTopic INFO - Only keep integer part.
2024-11-06 13:40:57,225 cisTopic INFO - Get non-zero regions.
2024-11-06 13:40:58,244 cisTopic INFO - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 13:41:32,441 cisTopic INFO Calculate (partial) imputed accessibility per cell for regions 60000-80000 (out of 1292285).
2024-11-06 13:41:32,441 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:41:42,069 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:41:53,982 cisTopic INFO - Only keep integer part.
2024-11-06 13:42:07,238 cisTopic INFO - Get non-zero regions.
2024-11-06 13:42:07,416 cisTopic INFO - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 13:42:43,969 cisTopic INFO Calculate (partial) imputed accessibility per cell for regions 80000-100000 (out of 1292285).
2024-11-06 13:42:43,970 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:42:53,047 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:43:03,882 cisTopic INFO - Only keep integer part.
2024-11-06 13:43:17,189 cisTopic INFO - Get non-zero regions.
2024-11-06 13:43:17,828 cisTopic INFO - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 13:43:56,578 cisTopic INFO Calculate (partial) imputed accessibility per cell for regions 100000-120000 (out of 1292285).
2024-11-06 13:43:56,579 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 13:44:05,636 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 13:44:16,481 cisTopic INFO - Only keep integer part.
2024-11-06 13:44:27,535 cisTopic INFO - Get non-zero regions.
2024-11-06 13:44:27,898 cisTopic INFO - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
...
2024-11-06 14:49:23,279 cisTopic INFO Calculate (partial) imputed accessibility per cell for regions 1260000-1280000 (out of 1292285).
2024-11-06 14:49:23,280 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:49:31,821 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:49:42,047 cisTopic INFO - Only keep integer part.
2024-11-06 14:49:53,865 cisTopic INFO - Get non-zero regions.
2024-11-06 14:49:54,500 cisTopic INFO - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 14:50:31,988 cisTopic INFO Calculate (partial) imputed accessibility per cell for regions 1280000-1300000 (out of 1292285).
2024-11-06 14:50:31,989 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:50:37,342 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:50:44,149 cisTopic INFO - Only keep integer part.
2024-11-06 14:50:52,495 cisTopic INFO - Get non-zero regions.
2024-11-06 14:50:53,017 cisTopic INFO - Calculate (partial) sum of imputed accessibility for the whole (partial) cell column.
2024-11-06 14:51:13,964 cisTopic INFO Keeping 1290880 of 1292285 (non_zero) regions.
region_idx_to_keep shape before removing non_zeros (1292285,)
region_idx_to_keep shape after removing non_zeros (1290880,)
region_topic shape after removing non_zeros (1290880, 80)
2024-11-06 14:51:14,135 cisTopic INFO Scale total imputed accessibility per cell by dividing by 10000.
2024-11-06 14:51:14,140 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility per region.
2024-11-06 14:51:14,140 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 0-20000 (out of 1290880).
2024-11-06 14:51:14,141 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:51:22,786 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:51:33,128 cisTopic INFO - Only keep integer part.
2024-11-06 14:51:44,702 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:52:00,313 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:52:17,304 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:52:26,246 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 20000-40000 (out of 1290880).
2024-11-06 14:52:26,246 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:52:34,903 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:52:45,269 cisTopic INFO - Only keep integer part.
2024-11-06 14:52:56,430 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:53:12,427 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:53:31,357 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:53:39,282 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 40000-60000 (out of 1290880).
2024-11-06 14:53:39,283 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:53:47,914 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:53:58,282 cisTopic INFO - Only keep integer part.
2024-11-06 14:54:09,453 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:54:25,709 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:54:44,669 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:54:52,616 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 60000-80000 (out of 1290880).
2024-11-06 14:54:52,618 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:55:01,233 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:55:12,710 cisTopic INFO - Only keep integer part.
2024-11-06 14:55:23,373 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:55:38,710 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:55:56,254 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:56:03,917 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 80000-100000 (out of 1290880).
2024-11-06 14:56:03,918 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:56:13,061 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:56:24,124 cisTopic INFO - Only keep integer part.
2024-11-06 14:56:38,653 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:56:56,205 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:57:12,180 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 14:57:20,304 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 100000-120000 (out of 1290880).
2024-11-06 14:57:20,307 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 14:57:29,095 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 14:57:40,723 cisTopic INFO - Only keep integer part.
2024-11-06 14:57:52,244 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 14:58:06,778 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 14:58:24,206 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
..
2024-11-06 16:13:07,266 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 1260000-1280000 (out of 1290880).
2024-11-06 16:13:07,268 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 16:13:16,119 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 16:13:26,693 cisTopic INFO - Only keep integer part.
2024-11-06 16:13:40,386 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 16:14:02,385 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 16:14:19,737 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 16:14:27,796 cisTopic INFO Calculate mean and dispersion of normalized imputed accessibility for regions 1280000-1300000 (out of 1290880).
2024-11-06 16:14:27,796 cisTopic INFO - Calculate imputed accessibility for the current chunk of regions.
2024-11-06 16:14:32,622 cisTopic INFO - Scale imputed accessibility matrix chunk (CPM normalization).
2024-11-06 16:14:38,472 cisTopic INFO - Only keep integer part.
2024-11-06 16:14:44,173 cisTopic INFO - Normalize imputed accessibility by dividing by the total imputed accessibility per cell and multiply by 10000.
2024-11-06 16:14:53,695 cisTopic INFO - Add pseudocount of 1 and apply log normalization.
2024-11-06 16:15:07,633 cisTopic INFO - Calculate mean and dispersion of imputed accessibility per region.
2024-11-06 16:15:11,838 cisTopic INFO Finished Calculating mean and dispersion of imputed accessibility per region.
CPU times: user 5h 13min 20s, sys: 1h 27min, total: 6h 40min 20s
Wall time: 2h 36min 59s
## Find highly variable features.
var_features = find_highly_variable_features(
features=region_names_to_keep,
per_region_means_on_normalized_imputed_acc=per_region_means_on_normalized_imputed_acc,
per_region_dispersions_on_normalized_imputed_acc=per_region_dispersions_on_normalized_imputed_acc,
min_disp = 0.05,
min_mean = 0.0125,
max_disp = float("inf"),
max_mean = 3,
n_bins = 20,
n_top_features = None,
plot = True,)
## Optional: save regions
pd.DataFrame(var_features, columns=["regions"]).to_csv("var_features.csv", index=False)
2024-11-06 16:29:23,756 cisTopic INFO Done!
Calculate DARs (old code still, WIP):
## read data
infile = open('cistopic_object.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
var_features_path = "var_features.csv"
var_features_df = pd.read_csv(var_features_path)
var_features = var_features_df['regions'].tolist()
## use old impute_accessibility function with variable features only
## note that this step may still require quite some resources
## this is only an intermediate solution and will be updated soon
imputed_acc_obj = impute_accessibility(
cistopic_obj,
selected_cells=None,
selected_regions=var_features,
scale_factor=10**6
)
## calculate DARs
markers_dict= find_diff_features(
cistopic_obj,
imputed_acc_obj,
variable='cell_type',
var_features=var_features,
contrasts = None,
adjpval_thr=0.05,
log2fc_thr=np.log2(1.5),
n_cpu=1,
_temp_dir='tmp_dir'
)
## Optional: save DARs
with open('DARs.pkl', 'wb') as f:
pickle.dump(markers_dict, f)
Hi, I am new to this, How do i install https://github.com/aertslab/pycisTopic/tree/polars_1xx?
I understand its a branch of pycistopic, but how do i access this branch when i have the latest main branch of scenic+ installed?
git clone https://github.com/aertslab/pycisTopic
cd pycisTopic
git checkout polars_1xx
pip install --no-deps .
find_diff_features is now also rewritten in the new pycistopic_v3 branch: https://github.com/aertslab/pycisTopic/commits/pycistopic_v3/, which makes is much faster (x15) and more memory efficient and does not use ray anymore:
Add get_marker_regions_for_contrast() as a much faster and memory efficient markers() replacement: https://github.com/aertslab/pycisTopic/commit/a820567a2e9ebea3400ddfc9d3f762d0705a9be3Rename find_diff_features to find_diff_accessible_regions and rewrite it completely to use the fast get_marker_regions_for_contrast: https://github.com/aertslab/pycisTopic/commit/064d28dcd65ef29cc1024ce20c7b22fa10fbfaaf
Hi,
I’m new to this package. I initially started with the main branch version but switched to the new polars_1xx to avoid running out of memory. However, I’m now having trouble with the binarize_topics function, which has changed. I can’t seem to obtain region_bin_topics_top_3k and region_bin_topics_otsu as shown in the tutorial.
I need these outputs to run the SCENIC+ Snakemake pipeline, so I need to save those region sets. Is there another way to obtain them with the current version?
I’d really appreciate any help in working around this issue.
Thanks, and congratulations on the amazing package!
Hi,
I have the same issue as @yolcasmar Do you have any fix?
Thank you ! Sébastien
Example of error I get for "otsu" (I have the same error for "3k" unless I set plot to False):
region_bin_topics_otsu = binarize_topics( cistopic_obj, method='otsu', plot=False, num_columns=5)
ValueError Traceback (most recent call last) Cell In[14], line 1 ----> 1 region_bin_topics_otsu = binarize_topics( 2 cistopic_obj, method='otsu', 3 plot=False, num_columns=5 4 )
File ~/miniconda3/envs/pycistopic_alone/lib/python3.11/site-packages/pycisTopic/topic_binarization.py:108, in binarize_topics(cistopic_obj, target, method, smooth_topics, ntop, predefined_thr, nbins, plot, figsize, num_columns, save) 106 thr = predefined_thr["Topic" + str(i + 1)] 107 elif method == "otsu": --> 108 thr = threshold_otsu(l_norm, nbins=nbins) 109 elif method == "yen": 110 thr = threshold_yen(l_norm, nbins=nbins)
File ~/miniconda3/envs/pycistopic_alone/lib/python3.11/site-packages/pycisTopic/topic_binarization.py:268, in threshold_otsu(array, nbins) 247 def threshold_otsu(array, nbins=100): 248 """ 249 Apply Otsu threshold on topic-region distributions [Otsu, 1979]. 250 (...) 266 cybernetics, 9(1), pp.62-66. 267 """ --> 268 hist, bin_centers = histogram(array, nbins) 269 hist = hist.astype(float) 270 # Class probabilities for all possible thresholds
File ~/miniconda3/envs/pycistopic_alone/lib/python3.11/site-packages/pycisTopic/topic_binarization.py:336, in histogram(array, nbins) 320 """ 321 Draw histogram from distribution and identify centers. 322 (...) 333 Histogram values and bin centers. 334 """ 335 array = array.ravel().flatten() --> 336 hist, bin_edges = np.histogram(array, bins=nbins, range=None) 337 bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2.0 338 return hist, bin_centers
File ~/miniconda3/envs/pycistopic_alone/lib/python3.11/site-packages/numpy/lib/histograms.py:780, in histogram(a, bins, range, density, weights) 680 r""" 681 Compute the histogram of a dataset. 682 (...) 776 777 """ 778 a, weights = _ravel_and_check_weights(a, weights) --> 780 bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights) 782 # Histogram is an integer or a float array depending on the weights. 783 if weights is None:
File ~/miniconda3/envs/pycistopic_alone/lib/python3.11/site-packages/numpy/lib/histograms.py:426, in _get_bin_edges(a, bins, range, weights)
423 if n_equal_bins < 1:
424 raise ValueError('bins must be positive, when an integer')
--> 426 first_edge, last_edge = _get_outer_edges(a, range)
428 elif np.ndim(bins) == 1:
429 bin_edges = np.asarray(bins)
File ~/miniconda3/envs/pycistopic_alone/lib/python3.11/site-packages/numpy/lib/histograms.py:323, in _get_outer_edges(a, range) 321 first_edge, last_edge = a.min(), a.max() 322 if not (np.isfinite(first_edge) and np.isfinite(last_edge)): --> 323 raise ValueError( 324 "autodetected range of [{}, {}] is not finite".format(first_edge, last_edge)) 326 # expand empty range to avoid divide by zero 327 if first_edge == last_edge:
ValueError: autodetected range of [nan, nan] is not finite