geofetch Metadata standardization

At the moment, geofetch can download, filter, save metadata for the specific accessions in GEO. But metadata in GEO is stored in different, messy ways. Some of the information can be redundant and some can be stored in different places.

e.g. sample genome information may be stored in 3 (or more) different keys (dictionary keys):

'Sample_description': ['assembly: 'hg19', ...]
"Sample_characteristics_ch1": ['genome build': 'hg19', ...]
"Sample_data_processing": ['Genome_build': 'hg19', ...]

To create good, standardized PEP .csv metadata file, all information has to be be carefuly preprocessed. Especially this can be useful to create new endpoint in pephub.

In my opinion we have to create new class, or set of function, that will be separated from geofetch and will standardize all GEO metadata.

Feb 09 '22 03:02 khoroshevskyi

@nsheff @nleroy917

Feb 09 '22 03:02 khoroshevskyi

Yes. I think you are right that this is outside the scope of geofetch, at the moment. The first goal needs to be just to get the data as it exists. The next goal could be to sanitize and unify it.

The first step can be completely automated and that's what geofetch should do.

This second step is a much larger project and will require a human to be involved. It could also be an application area for some techniques from natural language processing.

I think we should start thinking about this but it is not going to be solved right at the beginning, so don't let it hold up finishing the first goal.

Feb 09 '22 11:02 nsheff

But metadata in GEO is stored in different, messy ways.

Seems like the motivation behind the NIH’s Big Data to Knowledge initiative. It sounds like standardizing messy meta-data was the goal of things like DataMed and the DATS model.

Feb 09 '22 14:02 nleroy917

geofetch geofetch copied to clipboard

Metadata standardization

geofetch
geofetch copied to clipboard