GEARS Question on data pre-processing

Hi, I'm very interested in this work thanks for sharing.

I see that the code uses preprocessed h5ad files. I'd be interested to know how these data are produced from the source data e.g. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE133344

Oct 05 '23 16:10 cbirchsy

Hi, @yhr91 @kexinhuang12345 Yes, I have the same confusion. How is the perturbation set generated, or what is the source of the perturbation set? Is it a priori knowledge of biology? I am training a large model, and I don’t know anything about how to get the data input. I hope I can get an answer! Thank you.

Oct 20 '23 01:10 monoplasty

Thanks for your question. @cbirchsy This notebook contains all the information needed for preprocessing a scanpy AnnData object in general and this is what we use for any new dataset.

https://github.com/snap-stanford/GEARS/blob/master/demo/data_tutorial.ipynb

As for the Norman dataset that you link to, it's from an older paper so needs some extra processing. I will try to share the specific preprocessing steps for that dataset soon.

Oct 20 '23 02:10 yhr91

@monoplasty What do you mean by perturbation set

Oct 20 '23 02:10 yhr91

@monoplasty What do you mean by perturbation set

thank you for your reply @yhr91 . Replogle, J. M. et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. The data in this article was obtained from 10x h5ad files. adata.obs dataframe only has cell_type columns. How to get the condition columns? my understanding is that condition columns are disturbance sets. How can I run gears software with data like this?

Oct 20 '23 08:10 monoplasty

(2) Create your own Perturb-Seq data Prepare a scanpy adata object with

adata.obs dataframe has condition and cell_type columns, where condition is the perturbation name for each cell. Control cells have condition format of ctrl, single perturbation has condition format of A+ctrl or ctrl+A, combination perturbation has condition format of A+B. adata.var dataframe has gene_name column, where each gene name is the gene symbol. adata.X stores the post-perturbed gene expression.

This is a case referenced in demo/data_tutorial.ipynb.

I think this condition field is the perturbation set. But I don't know how to get the data of this column.

In other words, how can I use GEARS to continue analyzing the results of cellranger?

Looking forward to your answer, thank you. @yhr91

Oct 24 '23 01:10 monoplasty

Thanks for your question. The condition column just refers to the gene that is perturbed within each cell. It is usually provided in the .obs metadata within the AnnData object.

Are you using this link to download the Replogle 2022 (Cell) dataset?

If so, you will find the condition variable under the gene column in adata.obs for any of the anndata files that are provided on that webpage.

      cell_barcode      gem_group   gene          gene_id  ...                                                  
 AAACCCAAGAAATCCA-27         27     NAF1  ENSG00000145414  ...      
 AAACCCAAGAACTTCC-31         31     BUB1  ENSG00000169679  ...

Following this the standard GEARS preprocessing is to just convert this column to the format <perturbed gene>+ctrl for single-gene perturbations.

Let me know if you have any other questions.

Oct 29 '23 08:10 yhr91

Hi thank you for a great paper. In your supplementary you said "This reduced the number of perturbations in the K562 cell line from 2058 to 1092" for the Replogle et al essential dataset. Can you provide the data preprocessing notebook for that or the list of perturbations?

Dec 12 '23 19:12 GordianArnav

Sorry for the very late response here. I will upload the data preprocessing script soon. In the meantime, you can directly load the replogle dataset using the appropriate arguments https://github.com/snap-stanford/GEARS/blob/df09d7ae34e90f5ef25afa389daf7c5c589e710d/gears/pertdata.py#L150-L152

Feb 27 '24 07:02 yhr91

GEARS GEARS copied to clipboard

Question on data pre-processing

GEARS
GEARS copied to clipboard