sc-pert Add notebook for downloading McFarland 2020 Figure 1 data

This PR adds a Jupyter notebook to download the data from McFarland et al., 2020 used to produce Figure 1 (i.e., response to idasanutlin and control DMSO for different cell lines). This PR also adds a utils.py file to the datasets folder containing reusable functions for downloading/preprocessing.

A couple things that should probably be hashed out before this gets merged:

What's the unit of abstraction that each data notebook should cover? For example, for this notebook I only included the data used to produce Fig. 1c in McFarland et al., 2020 as opposed to all of the data. This was in part because I already had code for this subset of the data ready to go, but also because it might get unwieldy to include all metadata values for all of the data even when they're not necessary (e.g. TP53 mutation status might not be relevant outside of the nutlin experiments).
Similar to 1., how much of the data processing lifecycle should each notebook cover? In my PR I include downloading the raw data as part of the notebook, but I see some notebooks in the repo start off from an h5ad file.
Is there a standard preprocessing/quality control workflow for all of the datasets or is the plan to do things more ad-hoc for each dataset? For now the anndata object in my notebook just contains raw counts.

Mar 25 '22 18:03 ethanweinberger

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Mar 25 '22 18:03 review-notebook-app[bot]

What's the unit of abstraction that each data notebook should cover? For example, for this notebook I only included the data used to produce Fig. 1c in McFarland et al., 2020 as opposed to all of the data. This was in part because I already had code for this subset of the data ready to go, but also because it might get unwieldy to include all metadata values for all of the data even when they're not necessary (e.g. TP53 mutation status might not be relevant outside of the nutlin experiments).

Similar to 1., how much of the data processing lifecycle should each notebook cover? In my PR I include downloading the raw data as part of the notebook, but I see some notebooks in the repo start off from an h5ad file.

Is there a standard preprocessing/quality control workflow for all of the datasets or is the plan to do things more ad-hoc for each dataset? For now the anndata object in my notebook just contains raw counts.

Hey Ethan, great questions! I'll post the answers here for now but ideally there would be some other documentation somewhere other than an obscure template.ipynb notebook.

As of now, for each dataset we define a [author_year].ipynb and [author_year]_curation.ipynb notebook. The intention is that [author_year]_curation.ipynb contains what you've currently pushed for Norman19 (accession link to .h5ad) and [author_year].ipynb contains all the preprocessing that happens to the anndata object after. By the end of [author_year]_curation.ipynb, you should have an anndata which contains all author-provided metadata labels, gene names, and a raw count matrix. The thought process behind this is that some users may want to do the preprocessing themselves, while other users may want to download several datasets knowing they've all been preprocessed similarly (e.g. when training machine learning models)
Hopefully answered in 1. [author_year]_curation.ipynb notebooks should start with the exact command to download the file. The idea is that the notebook should contain everything a user needs to exactly reproduce the data as linked from the repository from a publicly available source.
There is currently a notebook called template.ipynb which calls code from the repo. Copying the notebook and adapting it to your dataset is the expected amount of standardization.

Mar 31 '22 11:03 yugeji

Got it- the distinction between the curation/preprocessing notebooks makes sense to me.

Based on that distinction, it seems like it makes sense to have the mcfarland_2020_curation notebook grab all of the potentially useful data/metadata and then people can subset it later if they want. I'll update this PR sometime in the next few days.

Mar 31 '22 18:03 ethanweinberger

Closing since this is taken care of by `mcfarland_2020_curation.ipynb'

Mar 31 '22 18:03 ethanweinberger

Reopening per @yugeji's request

Apr 01 '22 17:04 ethanweinberger

sc-pert sc-pert copied to clipboard

Add notebook for downloading McFarland 2020 Figure 1 data

sc-pert
sc-pert copied to clipboard