kg-covid-19 Time-course of HCoV-229E infection of A549 cells

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89159

Mar 19 '20 19:03 pnrobinson

I'm happy to help with these data. Perhaps it is worth talking about how we want them pre-processed? Otherwise, I'm happy to use a standard script we have used in my lab.

Mar 22 '20 16:03 callahantiff

I think that we basically want to have lists of differentially expressed/spliced(?) genes following Corona infection

Mar 22 '20 17:03 pnrobinson

A long time ago we wrote scripts for downloading stuff from GEO and processing the microarray data http://ontologizer.de/howto/ Probably the scripts need to be modernzed

Mar 22 '20 17:03 pnrobinson

Sounds good. I'll take a look at the ontologizer stuff and work on this later today and tomorrow.

Thanks @pnrobinson!

Mar 22 '20 17:03 callahantiff

I'm happy to help with these data. Perhaps it is worth talking about how we want them pre-processed? Otherwise, I'm happy to use a standard script we have used in my lab.

hey @callahantiff - I think if you have existing software to ingest these data, you could just make a PR. put your script in its own sub-directory in transform, then do this:

add a block for the URL of any files you need to incoming.yaml
as described here, alter your script to emit a nodes.tsv with these columns: id name category iri publications
and an edges.tsv with these columns: subject edge_label object relation provided_by publications
then add a line in run.py to call your script in transform/[your subdirectory]

Mar 22 '20 17:03 justaddcoffee

I'm happy to help with these data. Perhaps it is worth talking about how we want them pre-processed? Otherwise, I'm happy to use a standard script we have used in my lab.

hey @callahantiff - I think if you have existing software to ingest these data, you could just make a PR. put your script in its own sub-directory in transform, then do this:

add a block for the URL of any files you need to incoming.yaml

as described here, alter your script to emit a nodes.tsv with these columns: id name category iri publications

and an edges.tsv with these columns: subject edge_label object relation provided_by publications

then add a line in run.py to call your script in transform/[your subdirectory]

Excellent, happy to follow these steps. Thanks @justaddcoffee!

Mar 22 '20 17:03 callahantiff

@justaddcoffee - two quick questions!

Do you want subdirectories added that match the project boards each of the data sources belong to (i.e.transcriptomics for this particular data), that help organize the data sources? It might help add some organization and in this situation where multiple GEO datasets are likely to be downloaded, might avoid duplicated work? Otherwise, I can just create a new directory and name it pmid28355270_hcov229e_a549_cells.
Do we have suggested nomenclature for c=naming new data source directories? If not, I would recommend something like what I have done above, the PMID, followed by a snippet of the GEO entry title.

Mar 23 '20 15:03 callahantiff

Do you want subdirectories added that match the project boards each of the data sources belong to (i.e.transcriptomics for this particular data), that help organize the data sources? It might help add some organization and in this situation where multiple GEO datasets are likely to be downloaded, might avoid duplicated work? Otherwise, I can just create a new directory and name it pmid28355270_hcov229e_a549_cells.

If you'd like to organize some datasets using subdirectories (transcriptomics, etc), I think that's reasonable

Do we have suggested nomenclature for c=naming new data source directories? If not, I would recommend something like what I have done above, the PMID, followed by a snippet of the GEO entry title.

Sounds reasonable - some sources will not have PMIDs, but it seems helpful to prepend these ids when we can

Mar 23 '20 15:03 justaddcoffee

Do you want subdirectories added that match the project boards each of the data sources belong to (i.e.transcriptomics for this particular data), that help organize the data sources? It might help add some organization and in this situation where multiple GEO datasets are likely to be downloaded, might avoid duplicated work? Otherwise, I can just create a new directory and name it pmid28355270_hcov229e_a549_cells.

If you'd like to organize some datasets using subdirectories (transcriptomics, etc), I think that's reasonable

Do we have suggested nomenclature for c=naming new data source directories? If not, I would recommend something like what I have done above, the PMID, followed by a snippet of the GEO entry title.

Sounds reasonable - some sources will not have PMIDs, but it seems helpful to prepend these ids when we can

Actually, and I should have thought of this already (d'oh! 😄) , we can just use the GEO identifier assigned to the dataset instead of the PMID. Since every geo dataset will have that type of identifier.

Mar 23 '20 16:03 callahantiff

kg-covid-19 kg-covid-19 copied to clipboard

Time-course of HCoV-229E infection of A549 cells

kg-covid-19
kg-covid-19 copied to clipboard