kg-covid-19 icon indicating copy to clipboard operation
kg-covid-19 copied to clipboard

Time-course of HCoV-229E infection of A549 cells

Open pnrobinson opened this issue 4 years ago • 9 comments

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89159

pnrobinson avatar Mar 19 '20 19:03 pnrobinson

I'm happy to help with these data. Perhaps it is worth talking about how we want them pre-processed? Otherwise, I'm happy to use a standard script we have used in my lab.

callahantiff avatar Mar 22 '20 16:03 callahantiff

I think that we basically want to have lists of differentially expressed/spliced(?) genes following Corona infection

pnrobinson avatar Mar 22 '20 17:03 pnrobinson

A long time ago we wrote scripts for downloading stuff from GEO and processing the microarray data http://ontologizer.de/howto/ Probably the scripts need to be modernzed

pnrobinson avatar Mar 22 '20 17:03 pnrobinson

Sounds good. I'll take a look at the ontologizer stuff and work on this later today and tomorrow.

Thanks @pnrobinson!

callahantiff avatar Mar 22 '20 17:03 callahantiff

I'm happy to help with these data. Perhaps it is worth talking about how we want them pre-processed? Otherwise, I'm happy to use a standard script we have used in my lab.

hey @callahantiff - I think if you have existing software to ingest these data, you could just make a PR. put your script in its own sub-directory in transform, then do this:

  • add a block for the URL of any files you need to incoming.yaml

  • as described here, alter your script to emit a nodes.tsv with these columns: id name category iri publications

  • and an edges.tsv with these columns: subject edge_label object relation provided_by publications

  • then add a line in run.py to call your script in transform/[your subdirectory]

justaddcoffee avatar Mar 22 '20 17:03 justaddcoffee

I'm happy to help with these data. Perhaps it is worth talking about how we want them pre-processed? Otherwise, I'm happy to use a standard script we have used in my lab.

hey @callahantiff - I think if you have existing software to ingest these data, you could just make a PR. put your script in its own sub-directory in transform, then do this:

  • add a block for the URL of any files you need to incoming.yaml
  • as described here, alter your script to emit a nodes.tsv with these columns: id name category iri publications
  • and an edges.tsv with these columns: subject edge_label object relation provided_by publications
  • then add a line in run.py to call your script in transform/[your subdirectory]

Excellent, happy to follow these steps. Thanks @justaddcoffee!

callahantiff avatar Mar 22 '20 17:03 callahantiff

@justaddcoffee - two quick questions!

  1. Do you want subdirectories added that match the project boards each of the data sources belong to (i.e.transcriptomics for this particular data), that help organize the data sources? It might help add some organization and in this situation where multiple GEO datasets are likely to be downloaded, might avoid duplicated work? Otherwise, I can just create a new directory and name it pmid28355270_hcov229e_a549_cells.
  2. Do we have suggested nomenclature for c=naming new data source directories? If not, I would recommend something like what I have done above, the PMID, followed by a snippet of the GEO entry title.

callahantiff avatar Mar 23 '20 15:03 callahantiff

Do you want subdirectories added that match the project boards each of the data sources belong to (i.e.transcriptomics for this particular data), that help organize the data sources? It might help add some organization and in this situation where multiple GEO datasets are likely to be downloaded, might avoid duplicated work? Otherwise, I can just create a new directory and name it pmid28355270_hcov229e_a549_cells.

If you'd like to organize some datasets using subdirectories (transcriptomics, etc), I think that's reasonable

Do we have suggested nomenclature for c=naming new data source directories? If not, I would recommend something like what I have done above, the PMID, followed by a snippet of the GEO entry title.

Sounds reasonable - some sources will not have PMIDs, but it seems helpful to prepend these ids when we can

justaddcoffee avatar Mar 23 '20 15:03 justaddcoffee

Do you want subdirectories added that match the project boards each of the data sources belong to (i.e.transcriptomics for this particular data), that help organize the data sources? It might help add some organization and in this situation where multiple GEO datasets are likely to be downloaded, might avoid duplicated work? Otherwise, I can just create a new directory and name it pmid28355270_hcov229e_a549_cells.

If you'd like to organize some datasets using subdirectories (transcriptomics, etc), I think that's reasonable

Do we have suggested nomenclature for c=naming new data source directories? If not, I would recommend something like what I have done above, the PMID, followed by a snippet of the GEO entry title.

Sounds reasonable - some sources will not have PMIDs, but it seems helpful to prepend these ids when we can

Actually, and I should have thought of this already (d'oh! 😄) , we can just use the GEO identifier assigned to the dataset instead of the PMID. Since every geo dataset will have that type of identifier.

callahantiff avatar Mar 23 '20 16:03 callahantiff