abagen
abagen copied to clipboard
Strange sample numbers/`NaN` expression values returned when using text file-based masks/parcellation
I am trying to obtain sample-level expression data for each of 8 components (~parcels) from a data-driven cortical decomposition performed in CIVET space. I have both a LH/RH binary mask per component, with value 0/1 for each of 41k vertices, and a whole-brain atlas version of the components comprising a LH/RH file with a label between 1-8 per vertex.
Issue:
The samples returned for each component by get_samples_in_mask()
sum up to 8695 samples total, which exceeds the number of unique AHBA samples (~3500). Based on the well_id
values of the output dataframes, it appears that some samples are returned for >1 mask, even though the masks are non-overlapping.
Using instead the whole-brain components atlas with get_expression_data(atlas, region_agg=None)
, I get an expression dataframe with a more reasonable *1678 samples, but this dataframe unusually contains some NaN
values - so it seems this approach is not working exactly as expected either.
*EDIT: This previously incorrectly said 1664, which is actually the number of samples I get if I don't use neuromaps.images.relabel_gifti()
when loading in my LH/RH label giftis. 1678 samples is reproducible with the code below.
Steps to reproduce:
The masks/atlas were converted from .txt to .gii format using neuromaps.images.construct_shape_gii()
. The CIVET midsurfaces used to construct abagen
-style atlases were converted from .obj to .gii using neuromaps.images.obj_to_gifti()
.
Data for reproducing bug: abagen_sample_bug_data.zip
Code used to obtain expression data:
# Python 3.8.5
import neuromaps
import abagen # v0.1.3
import nibabel as nib # v3.2.1
import pandas as pd # v1.1.2
import numpy as np # v1.19.2
ncomps=8
surface = ('surfaces/PS_N266_lh_average_midsurface.surf.gii', 'surfaces/PS_N266_rh_average_midsurface.surf.gii')
'''
1. Getting sample-level expression data for each component using binary masks
'''
def get_mask_exp(c):
component = (f'binary_masks/lh_component{c}.label.gii', f'binary_masks/rh_component{c}.label.gii')
mask = abagen.check_atlas(component, geometry = surface, space = 'fslr')
samples, coords = abagen.get_samples_in_mask(mask)
return samples
# run get_mask_exp() for each component
samples_list=[]
for c in range(1,ncomps+1):
mysamples = get_mask_exp(c)
samples_list.append(mysamples)
mysamples.to_pickle(f'samples{c}.pkl')
samples_stacked = pd.concat(samples_list, axis=0)
samples_stacked.to_pickle('samples_stacked.pkl')
'''
2. Getting sample-level expression data for each component using a whole-brain atlas
'''
# relabel GIFTI images such that labels for atlas are consecutive across hemispheres (so, component 1 = labels 1 and 9, component 2 = labels 2 and 10, etc.)
components = neuromaps.images.relabel_gifti(('atlas/lh_components.label.gii', 'atlas/rh_components.label.gii'))
atlas = abagen.check_atlas(components, geometry = surface, space = 'fslr') # returns: AtlasTree[n_rois=16, n_vertex=77122]
allsamples = abagen.get_expression_data(atlas, region_agg=None)
# check for NaN values
nan_total = allsamples.isna().sum().sum() # 463
nan_whichgenes = allsamples.columns[allsamples.isna().any()].to_list() # ['MALAT1', 'RPS4Y2']
### extract + save samples for each component from atlas output ###
# samples_bycomp = []
# for c in range(1,ncomps+1): # for each component number
# samples_bycomp.append(allsamples.loc[[c,ncomps+c],]) # store dataframe containing rows of `allsamples` with LH/RH `label` corresponding to component
# samples_bycomp[c-1].reset_index(level='label', drop=True, inplace=True) # drop `label` index level
# samples_bycomp[c-1].to_pickle(f'atlas_samples{c}.pkl')
#
# atlas_samples_stacked = pd.concat(samples_bycomp, axis=0)
# atlas_samples_stacked.to_pickle('atlas_samples_stacked.pkl')
Any thoughts on how to troubleshoot this would be much appreciated!
I corrected a small typo I noticed in the issue. Please see "*EDIT" in the comment above. Sorry for any confusion!