geofetch
geofetch copied to clipboard
Metadata standardization
At the moment, geofetch can download, filter, save metadata for the specific accessions in GEO. But metadata in GEO is stored in different, messy ways. Some of the information can be redundant and some can be stored in different places.
e.g. sample genome information may be stored in 3 (or more) different keys (dictionary keys):
- 'Sample_description': ['assembly: 'hg19', ...]
- "Sample_characteristics_ch1": ['genome build': 'hg19', ...]
- "Sample_data_processing": ['Genome_build': 'hg19', ...]
To create good, standardized PEP .csv metadata file, all information has to be be carefuly preprocessed. Especially this can be useful to create new endpoint in pephub.
In my opinion we have to create new class, or set of function, that will be separated from geofetch and will standardize all GEO metadata.
@nsheff @nleroy917
Yes. I think you are right that this is outside the scope of geofetch, at the moment. The first goal needs to be just to get the data as it exists. The next goal could be to sanitize and unify it.
The first step can be completely automated and that's what geofetch should do.
This second step is a much larger project and will require a human to be involved. It could also be an application area for some techniques from natural language processing.
I think we should start thinking about this but it is not going to be solved right at the beginning, so don't let it hold up finishing the first goal.