traitdataform icon indicating copy to clipboard operation
traitdataform copied to clipboard

include more trait datasets incl. Std version

Open fdschneider opened this issue 7 years ago • 15 comments

the package should provide more datasets from the living spreadshet (https://github.com/fdschneider/bexis_traits/issues/20).

  • [x] identify data for integration
  • [x] write script to extract data upon call of data() (files are placed in 'data/' directory)
  • [x] include documnetation in package files 'R/data.R'

A standardised version of each dataset should be provided as well (linking to trait Thesauri and taxon Ontologies).

fdschneider avatar Nov 15 '17 11:11 fdschneider

No super sure about including more datasets in the package itself (I don't know if there is an "ideal" size for a package). If we do, they should be small, I guess. We can alternatively/also provide a tutorial with more examples on how to handle different trait datasets using the package (not only the CC.0 ones).

caterinap avatar Nov 15 '17 12:11 caterinap

Ha! Trick is, we're not including the datasets, just provide code to pull the datasets from their source:

See files in data.R. Only when you call data(carabids) the file is downloaded and made available for use. The package remains small. The user decides what to download.

The package vignette contains plenty of advice on how to harmonize own data, or data from other sources.

fdschneider avatar Nov 15 '17 15:11 fdschneider

Then it's all good!! Sorry, I need to dive a bit more into the package!

caterinap avatar Nov 17 '17 09:11 caterinap

@fdschneider I started to add more datasets in https://github.com/caterinap/traitdataform/tree/master/data. See if it's fine, I can continue adding more later in the week. Also added more entries in the spreadsheet and a new column indicating if the dataset is in the package.

caterinap avatar Nov 21 '17 15:11 caterinap

Hi @fdschneider, you're initiative seems really cool! I hope to use it soon ;)

A lot of work has been done by people who built Eco Data Retriever (http://www.data-retriever.org/, Github Repo) you can see the available datasets here.

I'm also thinking about the trait package by rOpenScience. Maybe you could use some wrappers to those already built tools?

Rekyt avatar Nov 21 '17 15:11 Rekyt

@Rekyt Thanks. Yes, I looked into those. We basically use the same idea as Retriever when pulling example datasets from the original sources on Figshare or wherever. The 'traits' package is great for tapping APIs of more extensive databases. There is also the package 'TR8'.

It would be cool to have wrappers for these data sources that add harmonization on top.

fdschneider avatar Nov 21 '17 15:11 fdschneider

Ok, now all CC.0 are in the package, on the same form as the "carabids" one. On windows I did not get errors when building the package (only warnings).

Some remarks:

  • I did not modify yet 3 datasets: biotraits, plantsBROT and plantsD3 because they have CC BY 4.0 license. For the moment they are still there and we can decide to remove or modify them later.
  • In the heteroptera_raw I did not change the coordinates into decimal because we do not import any package to do so (as far as I saw)

Have a look and let me know if you want to add/remove/change anything!

caterinap avatar Nov 28 '17 15:11 caterinap

Great, thanks.

I will pull and test it.

I wasn't aware that some of those datasets have so many traits. Great job mapping them to the ontologies. However, I just noticed that the URIs in Nadjas list are not correct. They should correspond to the URL with headings: e.g. https://ecologicaltraitdata.github.io/TraitDataList/#age_at_reproduction. We should fix this in the TraitDataList repository, @nadjasimons.

Furthermore, I thought that some of the cryptic trait names might be replaced by more intuitive trait names.
E.g. if the thesaurus call states

X10.2_SocialGrpSize = traitdataform::as.trait("social_group_size", expectedUnit = NA,
valueType = "numeric"),

The function standardize() will keep the original name in traitName but replace it with the easier one in traitNameStd.

The CC BY 4.0 data could be added in the future in just the same way, since we always state the correct reference.

I think the Ricklefs data on passerine birds can't be included since it is not labelled as public domain or CC by. Sorry, that license statement in the documentation is my fault, I guess. I already removed it from the current version.

fdschneider avatar Nov 28 '17 15:11 fdschneider

ok, so I will:

  • [x] change the URIs once they are fixed
  • [x] modify cryptic trait names
  • [ ] add the CC BY 4.0 datasets (when I have a bit of time)

Concerning the passerine, I actually checked before adding it and in the metadata (which is a word file in the supplementary) he states:

  1. Copyright restrictions: None
  2. Proprietary restrictions: None
  3. Costs: None

So I guess that we could keep it.

caterinap avatar Nov 28 '17 15:11 caterinap

Ok, thanks. No pressure. Whenever you find time.

The passerines: I'm relieved. After I was assured that the data are open by a colleaque, I was desperately looking for this disclaimer but didn't find it. Great 'bad example' for open data labelling.

fdschneider avatar Nov 28 '17 16:11 fdschneider

I fixed URIs in the trait data list

nadjasimons avatar Nov 29 '17 17:11 nadjasimons

For now this is put on halt because it overlaps with functionality provided by Will Pearses natdb package (@willpearse). They include 100+ datasets with short recipes (see this file), and in the process fix some major heterogeneity in the data (like replacing abbreviations with species names or adding units). I did not have the time to investigate how the data are processed into a virtual database. We should figure out how the two packages can complement each other.

Regardless, I would like to include Caterinas Pull request for v1.0 to have some more example datasets to draw from.

fdschneider avatar Nov 26 '18 16:11 fdschneider

Sorry to have been a bit slow to reply to this mention.

We have a plan, right now, to get a citable bioRxiv paper for MADworld (which is going to combine NACDB and NATDB) up ~late January early February. We are definitely interested in inter-operability, and I would love to make a wrapper linking your data structure into NATDB format. As I've mentioned before, but don't mind saying again, I think what you've done here is fantastic!!!

On Mon, 26 Nov 2018 at 09:18, Florian Schneider [email protected] wrote:

For now this is put on halt because it overlaps with functionality provided by Will Pearses natdb package https://github.com/willpearse/natdb (@willpearse https://github.com/willpearse). They include 100+ datasets with short recipes (see this file https://github.com/willpearse/natdb/blob/master/R/downloads.R), and in the process fix some major heterogeneity in the data (like replacing abbreviations with species names or adding units). I did not have the time to investigate how the data are processed into a virtual database. We should figure out how the two packages can complement each other.

Regardless, I would like to include Caterinas Pull request for v1.0 to have some more example datasets to draw from.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/EcologicalTraitData/traitdataform/issues/20#issuecomment-441699711, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLcUi-l4YVa5aRl9leCv-fzs84DTOyNks5uzBRJgaJpZM4Qe0U6 .

willpearse avatar Dec 04 '18 18:12 willpearse

Thanks Will, and sorry for not keeping up with our earlier e-mail discussion. I wanted to get a first functional version out before investigating further on interfaces with other tools. Let me know how I can help making this work seamlessly with your package.

fdschneider avatar Dec 07 '18 16:12 fdschneider

No worries; that's just life! :D

Makes sense to get something out that's functional first. When you have that ready, ping me and I will (1) take a look and then (2) figure out a path forward.

willpearse avatar Dec 07 '18 16:12 willpearse