fishbaseapi icon indicating copy to clipboard operation
fishbaseapi copied to clipboard

Search for term anywhere in database

Open cboettig opened this issue 10 years ago • 11 comments

@sckott I'm wondering if we can construct an endpoint to search for a term anywhere in the database.

Given how disorganized the FishBase SQL is, it can be pretty hard to know which table to find something in (e.g. min/max temp, https://github.com/ropensci/rfishbase/issues/47, that several people have requested recently -- it must be there somewhere since it's on the species summary pages).

Haven't found a great solution for doing this in SQL, but there's a few ideas:

  • http://stackoverflow.com/questions/639531/search-in-all-fields-from-every-table-of-a-mysql-database
  • http://stackoverflow.com/questions/562457/search-for-all-occurrences-of-a-string-in-a-mysql-database

The information_schema answer in the first link looks promising. Let me know if you get a chance to take a whack at this?

cboettig avatar Jun 26 '15 23:06 cboettig

Hmm. We could index the whole thing in elasticsearch, and search that way. May not be appropriate, just off the top

sckott avatar Jul 10 '15 14:07 sckott

Something like this, playing with my installation locally:

SELECT TABLE_NAME,COLUMN_NAME FROM information_schema.`columns` limit 300, 5;
+-------------+--------------------+
| TABLE_NAME  | COLUMN_NAME        |
+-------------+--------------------+
| TABLESPACES | TABLESPACE_TYPE    |
| TABLESPACES | LOGFILE_GROUP_NAME |
| TABLESPACES | EXTENT_SIZE        |
| TABLESPACES | AUTOEXTEND_SIZE    |
| TABLESPACES | MAXIMUM_SIZE       |
+-------------+--------------------+
5 rows in set (0.09 sec)

Could do this query to find table, then merge columns required to get the data needed?

I guess if we're thinking of re-orging the API based on a flat file of everything, then this is moot

sckott avatar Jul 10 '15 19:07 sckott

Nice, this looks promising. Let's build an endpoint around this for now, since it should be quick. (I think it will be a ways away before we can replace everything with a flat file, but guess we'll see)

On Fri, Jul 10, 2015 at 12:10 PM, Scott Chamberlain < [email protected]> wrote:

Something like this, playing with my installation locally:

SELECT TABLE_NAME,COLUMN_NAME FROM information_schema.columns limit 300, 5;

+-------------+--------------------+ | TABLE_NAME | COLUMN_NAME | +-------------+--------------------+ | TABLESPACES | TABLESPACE_TYPE | | TABLESPACES | LOGFILE_GROUP_NAME | | TABLESPACES | EXTENT_SIZE | | TABLESPACES | AUTOEXTEND_SIZE | | TABLESPACES | MAXIMUM_SIZE | +-------------+--------------------+ 5 rows in set (0.09 sec)

Could do this query to find table, then merge columns required to get the data needed?

I guess if we're thinking of re-orging the API based on a flat file of everything, then this is moot

— Reply to this email directly or view it on GitHub https://github.com/ropensci/fishbaseapi/issues/59#issuecomment-120496563 .

Carl Boettiger UC Santa Cruz http://carlboettiger.info/

cboettig avatar Jul 13 '15 17:07 cboettig

@cboettig working on this, not quite sure what the use case is right now

Are users going to query for fields X, Y, and Z, and return those in any table, if found? Does this endpoint look for fields, and if exist, then return those fields (regardless of table they're in)? if so, then main parameter I guess is fields to specify what fields to return.

But there definitely are duplicate field names in different tables. how to deal with that? I guess users could give namespaced field names, like species:name where the table is species and the field they want is name - then we just parse those namespaced strings apart, and use them

sckott avatar Jul 21 '15 01:07 sckott

@sckott Here's an example use case: discover which table(s) contain the data of min and max temp for a species range (e.g. see https://github.com/ropensci/rfishbase/issues/47).

I imagine the user would query something like "Min Temp", though not quite sure if one would need good fuzzy matching. The exact field is called TempMin and comes from the stocks table, which isn't particularly intuitive from the name "stocks" (as the issue 47 discussion highlights). So the basic return object for the function would be a list of table names and the corresponding field name (since users are unlikely to guess the exact format of the the field name, e.g. TempMin and not MinTemp etc).

Like you say, it's not clear how to handle duplicate field names -- I imagine one would just return a list of all tables containing a match. (that may be useful more generally; e.g. if the duplicate really does refer to the same thing, like a StockID, this may be a way to discover tables that can be joined using a given variable).

cboettig avatar Jul 21 '15 18:07 cboettig

@cboettig that makes sense I think. Already have a working route for getting all tables and fields. Will try to add support for a fields parameter passed in to filter by whatever user wants to find

sckott avatar Jul 21 '15 19:07 sckott

@cboettig Okay, added a route /listfields https://github.com/ropensci/fishbaseapi/blob/master/api_docs.md#listfields - we can rename to anything you want.

Restarted server and getting same problem as yesterday though, so it's not up yet

sckott avatar Jul 21 '15 19:07 sckott

when it's up try:

  • fishbase.ropensci.org/listfields
  • fishbase.ropensci.org/listfields?fields=Year
  • fishbase.ropensci.org/listfields?fields=Year,Genus
  • fishbase.ropensci.org/listfields?fields=Year,Genus&exact=true

sckott avatar Jul 21 '15 19:07 sckott

WRT searching anywhere in the DB, I imagine Solr is our best bet, or ES as long as we're sure it can be secured properly. Guess it's a matter of whether we want to put in the effort for this

sckott avatar Mar 11 '16 18:03 sckott

see also https://fishbaseapi.readme.io/v1.0/discuss/5ceeefef9dd565022c894dbf

sckott avatar May 31 '19 19:05 sckott

Added fuzzy search for the /taxa route only, for only the params genus and species

sckott avatar May 31 '19 23:05 sckott