taxize icon indicating copy to clipboard operation
taxize copied to clipboard

Add ability to recursively filter output from get_* functions

Open sckott opened this issue 11 years ago • 6 comments

E.g.,

get_tsn("Poa")

A lot of results are given...so user filters with regex

# some output printed, prompt given
# user types:
ann
# which filters to strings having "ann"

or by row number(s)

# some output printed, prompt given
# user types:
1:5
# which filters to rows 1 to 5

And this could go on recursively until user exits or ends up with only one result, thus giving back the id itself

thoughts @EDiLD @zachary-foster

sckott avatar Dec 01 '14 20:12 sckott

@sckott, I just noticed you asked fo thoughts on this. I tried running get_tsn("Poa") and get_tsn('Poa', ask=TRUE, rows = NA), but just got back a single result. Did something change in the last month? I also tried get_tsn('Satyrium'), another ambiguous taxon name, and only got back a single result.

zachary-foster avatar Jan 15 '15 21:01 zachary-foster

Oh yea, I forgot to share thoughts. I think its a good idea if it does not take too much work to implement. Is it common for there to be that many homonyms for a taxon name? Or perhaps get_tsn("Poa") used to return the taxon ids for all of the species in that genus rather than the genus itself?

zachary-foster avatar Jan 15 '15 21:01 zachary-foster

@zachary-foster yes, there have been some changes

There are two changes: For get_tsn() we get accepted names by default now, see the accepted parameter

For the case of Poa annua using ITIS data, the API call http://www.itis.gov/ITISWebService/services/ITISService/getITISTermsFromScientificName?srchKey=poa%20annua results in just one name that is accepted, while all others are not accepted, so only one is returned.

Second, we now check for a direct match using grep(). If the regrex match returns only one match, then we just return that one thing, if more than one match, we return all of them and user is given prompt, etc.

Does that makes sense?

sckott avatar Jan 15 '15 22:01 sckott

@zachary-foster for your second comment:

Hard to say how common multiple names are, depends on the structure of the queries done on the server side of data sources too, some may do a more fuzzy search approach, and some more of a direct match search - I don't think I've tried implementing this yet, so not sure how hard it would be, but worth a try?

sckott avatar Jan 15 '15 22:01 sckott

@sckott Ok, I understand now. Thanks for the explanation.

I think its worth a try. I dont know if you meant "recursively" literally, but a while (nrow(tsn_df) > 1) {...} loop around the current user prompt code seems like it would work. In the case of get_tsn, maybe something like (untested code):

if (ask) {
  names(tsn_df)[grep(searchtype, names(tsn_df))] <- "target"
  tsn_df <- tsn_df[order(tsn_df$target), ]
  rownames(tsn_df) <- 1:nrow(tsn_df)
  while (nrow(tsn_df) > 1) {
    message("\n\n")
    print(tsn_df)
    message("\nMore than one TSN found for taxon '", 
            x, "'!\n\n            Enter rownumber of taxon (other inputs will return 'NA'):\n")
    take <- scan(n = 1, quiet = TRUE, what = "raw")
    if (length(take) == 0) {
      take <- "notake"
      att <- "nothing chosen"
    }
    if (take %in% seq_len(nrow(tsn_df))) {
      take <- as.numeric(take)
      message("Input accepted, took taxon '", as.character(tsn_df$target[take]), 
              "'.\n")
      tsn <- tsn_df$tsn[take]
      att <- "found"
    }
    else if (any(grepl(take, tsn_df$target))) {
      tsn_df <- tsn_df[grepl(take, tsn_df$target), ]
      tsn <- tsn_df$tsn
    }
    else {
      tsn <- NA
      mssg(verbose, "\nReturned 'NA'!\n\n")
      att <- "not found"
    }

  }
}
else {
  tsn <- NA
  att <- "NA due to ask=FALSE"
}

If you are worried about the possiblity of infinite loops caused by while, maybe a for (1:max_prompts) with a if (nrow(tsn_df) == 1) break.

zachary-foster avatar Jan 16 '15 00:01 zachary-foster

@zachary-foster Right, while loop seems appropriate

sckott avatar Jan 16 '15 19:01 sckott