GEARS
GEARS copied to clipboard
Question on data pre-processing
Hi, I'm very interested in this work thanks for sharing.
I see that the code uses preprocessed h5ad files. I'd be interested to know how these data are produced from the source data e.g. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE133344
Hi, @yhr91 @kexinhuang12345 Yes, I have the same confusion. How is the perturbation set generated, or what is the source of the perturbation set? Is it a priori knowledge of biology? I am training a large model, and I don’t know anything about how to get the data input. I hope I can get an answer! Thank you.
Thanks for your question. @cbirchsy This notebook contains all the information needed for preprocessing a scanpy AnnData object in general and this is what we use for any new dataset.
https://github.com/snap-stanford/GEARS/blob/master/demo/data_tutorial.ipynb
As for the Norman dataset that you link to, it's from an older paper so needs some extra processing. I will try to share the specific preprocessing steps for that dataset soon.
@monoplasty What do you mean by perturbation set
@monoplasty What do you mean by perturbation set
thank you for your reply @yhr91 . Replogle, J. M. et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. The data in this article was obtained from 10x h5ad files. adata.obs
dataframe only has cell_type
columns. How to get the condition
columns? my understanding is that condition
columns are disturbance sets. How can I run gears software with data like this?
(2) Create your own Perturb-Seq data Prepare a scanpy adata object with
adata.obs dataframe has condition and cell_type columns, where condition is the perturbation name for each cell. Control cells have condition format of ctrl, single perturbation has condition format of A+ctrl or ctrl+A, combination perturbation has condition format of A+B. adata.var dataframe has gene_name column, where each gene name is the gene symbol. adata.X stores the post-perturbed gene expression.
This is a case referenced in demo/data_tutorial.ipynb
.
I think this condition
field is the perturbation set. But I don't know how to get the data of this column.
In other words, how can I use GEARS to continue analyzing the results of cellranger?
Looking forward to your answer, thank you. @yhr91
Thanks for your question. The condition column just refers to the gene that is perturbed within each cell. It is usually provided in the .obs
metadata within the AnnData
object.
Are you using this link to download the Replogle 2022 (Cell) dataset?
If so, you will find the condition variable under the gene
column in adata.obs
for any of the anndata files that are provided on that webpage.
cell_barcode gem_group gene gene_id ...
AAACCCAAGAAATCCA-27 27 NAF1 ENSG00000145414 ...
AAACCCAAGAACTTCC-31 31 BUB1 ENSG00000169679 ...
Following this the standard GEARS preprocessing is to just convert this column to the format <perturbed gene>+ctrl
for single-gene perturbations.
Let me know if you have any other questions.
Hi thank you for a great paper. In your supplementary you said "This reduced the number of perturbations in the K562 cell line from 2058 to 1092" for the Replogle et al essential dataset. Can you provide the data preprocessing notebook for that or the list of perturbations?
Sorry for the very late response here. I will upload the data preprocessing script soon. In the meantime, you can directly load the replogle dataset using the appropriate arguments https://github.com/snap-stanford/GEARS/blob/df09d7ae34e90f5ef25afa389daf7c5c589e710d/gears/pertdata.py#L150-L152