kg-covid-19
kg-covid-19 copied to clipboard
Time-course of HCoV-229E infection of A549 cells
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89159
I'm happy to help with these data. Perhaps it is worth talking about how we want them pre-processed? Otherwise, I'm happy to use a standard script we have used in my lab.
I think that we basically want to have lists of differentially expressed/spliced(?) genes following Corona infection
A long time ago we wrote scripts for downloading stuff from GEO and processing the microarray data http://ontologizer.de/howto/ Probably the scripts need to be modernzed
Sounds good. I'll take a look at the ontologizer stuff and work on this later today and tomorrow.
Thanks @pnrobinson!
I'm happy to help with these data. Perhaps it is worth talking about how we want them pre-processed? Otherwise, I'm happy to use a standard script we have used in my lab.
hey @callahantiff - I think if you have existing software to ingest these data, you could just make a PR. put your script in its own sub-directory in transform
, then do this:
-
add a block for the URL of any files you need to
incoming.yaml
-
as described here, alter your script to emit a
nodes.tsv
with these columns:id name category iri publications
-
and an
edges.tsv
with these columns:subject edge_label object relation provided_by publications
-
then add a line in
run.py
to call your script intransform/[your subdirectory]
I'm happy to help with these data. Perhaps it is worth talking about how we want them pre-processed? Otherwise, I'm happy to use a standard script we have used in my lab.
hey @callahantiff - I think if you have existing software to ingest these data, you could just make a PR. put your script in its own sub-directory in
transform
, then do this:
- add a block for the URL of any files you need to
incoming.yaml
- as described here, alter your script to emit a
nodes.tsv
with these columns:id name category iri publications
- and an
edges.tsv
with these columns:subject edge_label object relation provided_by publications
- then add a line in
run.py
to call your script intransform/[your subdirectory]
Excellent, happy to follow these steps. Thanks @justaddcoffee!
@justaddcoffee - two quick questions!
- Do you want subdirectories added that match the project boards each of the data sources belong to (i.e.
transcriptomics
for this particular data), that help organize the data sources? It might help add some organization and in this situation where multiple GEO datasets are likely to be downloaded, might avoid duplicated work? Otherwise, I can just create a new directory and name itpmid28355270_hcov229e_a549_cells
. - Do we have suggested nomenclature for c=naming new data source directories? If not, I would recommend something like what I have done above, the PMID, followed by a snippet of the GEO entry title.
Do you want subdirectories added that match the project boards each of the data sources belong to (i.e.transcriptomics for this particular data), that help organize the data sources? It might help add some organization and in this situation where multiple GEO datasets are likely to be downloaded, might avoid duplicated work? Otherwise, I can just create a new directory and name it pmid28355270_hcov229e_a549_cells.
If you'd like to organize some datasets using subdirectories (transcriptomics
, etc), I think that's reasonable
Do we have suggested nomenclature for c=naming new data source directories? If not, I would recommend something like what I have done above, the PMID, followed by a snippet of the GEO entry title.
Sounds reasonable - some sources will not have PMIDs, but it seems helpful to prepend these ids when we can
Do you want subdirectories added that match the project boards each of the data sources belong to (i.e.transcriptomics for this particular data), that help organize the data sources? It might help add some organization and in this situation where multiple GEO datasets are likely to be downloaded, might avoid duplicated work? Otherwise, I can just create a new directory and name it pmid28355270_hcov229e_a549_cells.
If you'd like to organize some datasets using subdirectories (
transcriptomics
, etc), I think that's reasonableDo we have suggested nomenclature for c=naming new data source directories? If not, I would recommend something like what I have done above, the PMID, followed by a snippet of the GEO entry title.
Sounds reasonable - some sources will not have PMIDs, but it seems helpful to prepend these ids when we can
Actually, and I should have thought of this already (d'oh! 😄) , we can just use the GEO identifier assigned to the dataset instead of the PMID. Since every geo dataset will have that type of identifier.