roweb post idea: Single UI for small and larger data

APIs are great, as we can all agree, but they aren't great for when you want ALL THE DATA, and all the data is very big, in which case just a simple compressed dump, or equivalent is sufficient.

But if it's a dump of flat files, or a SQL DB, then users can't take advantage of the slick user interfaces we are trying to build since they are normally built around web APIs

So can we make a single interface to both web APIs and flat files or DB's? with taxize, we're trying to do this, where SQL dumps are available, make functions act exactly the same, but swap in and out web API vs. local SQL DB.

Feb 15 '15 21:02 sckott

very cool, sounds like this is in the spirit of what @hadley has done with dplyr functions, in that they can work in exactly the same way whether the data is in a data.frame in local memory or on a (potentially remote) database server.

Of course there is still a trade-off here between convenience and completeness; a user of the API has only to understand the API call and not the full data schema.

Nothing prevents returning all the data in a RESTful API call, it just isn't generally seen as that useful since it seems API design assumes the client is just some webpage that wants to show some specific small data and not deal with the whole thing. Such an API might return in some more concise format than XML or JSON (or rather, return some metadata in such a format along with a url to download the compressed full data).

All this is to say that this is really a question of API design I think, and on more of a continuum, rather than a binary issue of API vs DB. After all, GET http://url/to/database/dump.sql is still a RESTful call.

On Sun, Feb 15, 2015, 1:24 PM Scott Chamberlain [email protected] wrote:

APIs are great, as we can all agree, but they aren't great for when you want ALL THE DATA, and all the data is very big, in which case just a simple compressed dump, or equivalent is sufficient.

But if it's a dump of flat files, or a SQL DB, then users can't take advantage of the slick user interfaces we are trying to build since they are normally built around web APIs

So can we make a single interface to both web APIs and flat files or DB's? with taxize, we're trying to do this, where SQL dumps are available, make functions act exactly the same, but swap in and out web API vs. local SQL DB.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/roweb/issues/164.

Feb 15 '15 21:02 cboettig

@cboettig right, it's not really binary, good point.

Another related point I forgot to mention was speed. So, users of taxize sometimes complain about speed. This issue is not simple, including users internet connection speed, server resources, etc., but nonetheless, is a reason to switch to a dump if available. Speed is tied up with size of the data and number of requests. If users have thousands of species they want to query data on, and ITIS doesn't allow batch querying (i.e., .../species?id=1,2,3,4), and their servers are very slow, then this becomes pretty painful.

In terms of change we can suggest changes to APIs, but it seems scientific data providers are often very underfunded, so we can only expect so much.

In terms of just dumping larger amounts of data via APIs, e.g. GBIF has a download API (http://www.gbif.org/developer/occurrence#download), and I tried to build it into our rgbif client, but they said please dont b/c they said it could bring their servers down.

Right, providers could provide GET http://.../dump.sql, but we'd still need to replicate the interface to that SQL interface for users that don't know SQL, which for our users I would guess is most of them

Feb 15 '15 22:02 sckott

Yeah, speed is a good question. In some cases there may be something sensible we can cache in the package to facilitate this.

For instance, in rfishbase I cache the taxa list within the package, so that any of the functions can easily move between, say, a family/class name and a list of all species it contains, or functions can internally convert species names to SpecCodes for querying without needing another call. The table has a bit over 37,000 rows but compressed into R is only a bit over 200 KB. It can be updated over the API in ~ 10s of seconds on a good connection, but is I think an obvious candidate for caching. However, other functions still have this same limitation of having to loop over a long species list API call per species, so we still face the problem there too.

I'm not familiar enough with taxize to know if there's anything that can be cached there to speed up the looping over species. Or maybe at some point it will make sense for us to serve our own endpoints to facilitate multi-species queries / larger data returns?

Maybe other workarounds too, e.g. if said species list shares a common higher level taxa, maybe it is possible to make a single API query on the higher group and then subset, rather than looping over the species list?

Anyway, if most users only problem is speed I'd take that as a major success.

On Sun, Feb 15, 2015, 2:00 PM Scott Chamberlain [email protected] wrote:

@cboettig https://github.com/cboettig right, it's not really binary, good point.

Another related point I forgot to mention was speed. So, users of taxize sometimes complain about speed. This issue is not simple, including users internet connection speed, server resources, etc., but nonetheless, is a reason to switch to a dump if available. Speed is tied up with size of the data and number of requests. If users have thousands of species they want to query data on, and ITIS doesn't allow batch querying (i.e., .../species?id=1,2,3,4), and their servers are very slow, then this becomes pretty painful.

In terms of change we can suggest changes to APIs, but it seems scientific data providers are often very underfunded, so we can only expect so much.

In terms of just dumping larger amounts of data via APIs, e.g. GBIF has a download API (http://www.gbif.org/developer/occurrence#download), and I tried to build it into our rgbif client, but they said please dont b/c they said it could bring their servers down.

Right, providers could provide GET http://.../dump.sql, but we'd still need to replicate the interface to that SQL interface for users that don't know SQL, which for our users I would guess is most of them

— Reply to this email directly or view it on GitHub https://github.com/ropensci/roweb/issues/164#issuecomment-74439594.

Feb 15 '15 22:02 cboettig