traits icon indicating copy to clipboard operation
traits copied to clipboard

Standardize data.frame's across data sources

Open sckott opened this issue 10 years ago • 10 comments

Right now, we haven't thought about outputs of each function. I believe all are data.frames. I'll look at each and see what they currently output and see what is shared among them, and see what standard format we can use that will also make combining outputs easier

sckott avatar Sep 11 '15 17:09 sckott

cc @hlapp @xu-hong

sckott avatar Sep 11 '15 17:09 sckott

@sckott any progress or conclusions on this yet? @xu-hong is about at that step (leaving the issues with XML namespaces aside). Suggestions as to additional decorations that should be added to the data.frame object returned by RNeXML::get_characters()?

hlapp avatar Sep 30 '15 20:09 hlapp

@hlapp sorry, hadn't looked at this yet. Will do so today

sckott avatar Sep 30 '15 20:09 sckott

Run down of what data objects functions currently output;

function output
betydb_trait data.frame
betydb_search data.frame
betydb_citation data.frame
betydb_site data.frame
betydb_specie data.frame
birdlife_habitat data.frame
birdlife_threats data.frame
coral_locations data.frame
coral_methodologies data.frame
coral_resources data.frame
coral_species data.frame
coral_taxa data.frame
coral_traits data.frame
eol_invasive_ data.frame
fe_native list
g_invasive data.frame
is_native data.frame
leda data.frame
ncbi_byid data.frame
ncbi_byname data.frame/named list of data.frames
ncbi_searcher data.frame/named list of data.frames
traitbank list
  • The bety_*() functions that return lists I think could be easily made to return data.frame's, I'll check. DONE

sckott avatar Sep 30 '15 21:09 sckott

Common fields among functions that could be standardized:

  • id - identifier for the record/taxon/etc. - the meaning of this id could vary between providers, however
  • date - date collected/updated, can be more than one of these
  • latitude - latitude, if spatially explicit record, and avail.
  • longitude - longitude, if spatially explicit record, and avail.
  • name - taxonomic name - combine any separate fields to make this one (not done yet), could be additional name columns

The above aren't real columns in outputs yet, but the ones I think could be standard across most of the data sources in this package.

Part of what makes this hard is that we have a diverse set of data sources, from morphological trait data, to nativity status, to molecular data

I think a way forward could be to provide a suite of functions that do some set of transformations to the data to standardize column names/etc. to allow them to be easily combined across taxa and data sources - at least across the standard fields - and other fields could be included as additional columns at the end

sckott avatar Sep 30 '15 21:09 sckott

Common fields among functions that could be standardized:

You mean columns among data frames?

hlapp avatar Sep 30 '15 21:09 hlapp

yes

sckott avatar Sep 30 '15 21:09 sckott

Suggestions as to additional decorations that should be added to the data.frame

@hlapp I don't know what's available to add

sckott avatar Sep 30 '15 22:09 sckott

name - taxonomic name - combine any separate fields to make this one (not done yet), could be additional name columns

@sckott do you mean the classification, or family and genus for taxa that are species?

hlapp avatar Oct 07 '15 20:10 hlapp

@hlapp I mean name could be ideally genus + epithet + any subspecific epithets OR previous + authority

I favor leaving authority off the name, and having in a separate column, if provided.

If there data record has lowest ID to family e.g,. then I don't know what best practice is. Perhaps we'd leave name blank

sckott avatar Oct 07 '15 20:10 sckott