OZtree
OZtree copied to clipboard
Search Relevance
It’s sometimes hard to tell which search result is the one you’re looking for due to very different nodes containing the common name term that you are using. I often find myself searching wikipedia for a species name first and then coming back to OneZoom with that.
Examples of poor relevance:
- ‘Fox’ top result is something with the common name ‘fox’ but is a moth
- https://www.onezoom.org/life/@Macrothylacia_rubi=140039
- ‘Butterfly’ top result is a mussell
- https://www.onezoom.org/life/@Ellipsaria_lineolata=190876
- ‘Zebra’ top results are a fish and an eel
- https://www.onezoom.org/life/@Branchiostegus_semifasciatus=560744
- ‘Lavender’ top result is sea lavender
- ‘Dog’ top result is Racoon Dog
- ‘Chicken’ top result takes you to a plant
Solutions may mean improving the ordering of search results (perhaps taking into account popularity), or making it easier to tell whether a result is the one you intended (perhaps including pictures or major group icons in the search list)
Useful examples, thanks. Popularity is definitely something we could use here (and I thought we did, actually, but clearly not, or not well enough)
I was fairly sure popularity figured in the search results as well, FWIW.
Maybe worth digging into those particular examples to see where we are going wrong?
I'd swear that this used to give more reasonable ordering, but it really doesn't seem to take popularity into account not.
Comparing the arctic fox (Vulpes lagopus, ott=775766) with the fox moth (Macrothylacia rubi, ott=140039):
https://www.onezoom.org/popularity/list?key=0&otts=775766,140039
"data": [
[775766, 260797.77, 318],
[140039, 148282.78, 618482]
]
So the arctic fox is far more popular, but you need to scroll like 4 pages down in the search results to find it.
I just noticed that we incorrectly have 'Fox' as the vernacular name for the 'Fox moth', so maybe being an exact match gives it an edge.
But it still doesn't explain why the arctic fox is far below many less popular taxa.
Some rough notes on how search ordering works at the moment:
- Popularity is not used for ordering search results in the tree of life explorer, though the search_nodes API does return popularity data
- The API does have an option to order by popularity, but it's not in use by the frontend
- The API uses MySQL full_text matching in boolean mode, so no relevance ordering is introduced there
- The frontend receives a list of matching nodes and leaves and computes 'overall_search_score's for each based on the type of text match found.
- Vernacular match is preferred over latin match
- Vernacular match is very strongly preferred over 'Extra vernacular' matches
- full string matches are preferred over partial matches
- partial matches at the end of the string are preferred over matches at the start
- etc.
Thanks for the analysis!
I think that taking popularity into account is going to be necessary here. Otherwise, we'll never get away from 'fox' being a moth, since it's an exact vernacular match. Whereas no actual species of fox is just called 'fox'. Of course, we should fix that moth's vernacular to be 'fox moth', but there are probably many cases of inaccurate vernaculars (e.g. same story for 'butterfly').
Fox case
What's curious about the results?
- The top result is a moth
- Because the vernacular is an exact match, there is no competition as far as the existing code is concerned.
- The popularity of this moth is 85,295
- The popularity of 'Foxes' is 165,818
- Flying foxes are also very high on the list here
- The name 'fox' occurs, after a space and at the end of the species name, so they get a pretty good score from
match_score - Flying foxes in general have a popularity of 130,550
- The name 'fox' occurs, after a space and at the end of the species name, so they get a pretty good score from
What's good about the results?
- 'Foxes' (http://localhost:8000/life/@_ozid=885806) is 2nd in this list and is a very relevant result
- Pluralised exact matches (which this is) are not scored quite as highly as exact matches without modification. There is perhaps an argument that they should be scored just as highly.
And here is the list ordered by popularity instead:
Chicken case
What's curious about the results?
- A plant is the top result
- The vernacular is correct
- Pluralised version of the search term is at the end of the vernacular, so scores fairly highly
- Popularity of this is 46,608
- Gallu gallus gets 129,833
- Gamebirds gets 150,625
- Red Junglefowl is very far down the list
- 'chicken' is only an extra vernacular for this, so ranks very low despite being an exact match
- Gamebirds is also very far down the list, for a similar reason: 'chicken-like birds' is only an extra vernacular and 'chicken' only appears at the start of it and not surrounded by spaces
What's good about these results?
- Hmm...
And here is the list ordered by popularity:
Lavender Case
What's curious about these results?
- Top result is a 'Sea Lavender'
- Exact match with search term at the end of the vernacular scores pretty highly
- This particular one has popularity 42,869
- English lavender has 49,220
- French lavender has 48,295
- No sea lavender has greater than 45,000
- There are several 'Sea Lavenders'
- From a cursory Google, 'sea lavender' does seem to be a fair vernacular for many of them
What's good about these results?
- There are many Lavandula's in the top results!
This is the list ordered naively by popularity
- The Black-tailed Waxbill at the top of the list has a much higher popularity: 116,535
Good triaging, thanks. I agree with your assessment. Any suggested fixes would be good.
Yeah it's tricky. Might not be tremendous quick wins here, and probably should have at least some small, manually-chosen test data to get some rough search relevance metrics (e.g. at least Mean Reciprocal Rank) when changing this.
One thought would be to keep the match_score approach, where full matches score better than partial, match at the end scores better than match at the start, etc. with a couple of modifications:
- allow punctuation after/before the match (e.g. "Grey wolf (and domestic dog)" should still count as matching 'dog' at the end and score highly for that)
- when checking that a word is delimited rather than being part of a larger word, should allow punctuation as delimiters. (e.g. "Chicken-like" should score as highly for 'chicken' as 'chicken like'
and then from there:
- no longer downgrade extra vernaculars so much, if at all
- no longer downgrade plural matches
- sort first by the string match score, and then by popularity. That is: stronger string match takes precedence over higher popularity, but popularity is involved.
Taking a rough swing at that (without any objective metrics), here's what the results look like for some of the cases above:
Notes about that:
- Top result for lavender is a crayfish
- Top result for 'zebra' is horses in general, which isn't ideal, I guess horses are popular.
- butterfly results are rubbish
- 2nd result for 'cat' is a fish but on the bright side:
- Red junglefowl is then the top result for chicken
- Grey wolf is then the top result for dog
But overall not very impressive results. I think maybe we won't get much further without some form of rating of vernaculars.
There are definitely some suspicious vernaculars floating around:
- I can't find a source that labels this crayfish as 'Broad-leaved lavender' http://localhost:8000/life/@Cambarus_laconensis=1083969
- Our DB has the source as EOL but I suppose EOL has since removed all names for this https://eol.org/pages/12017821/names
- Nearby, this simlarly doesn't look like it's supposed to be labelled 'Fragrant evening primrose' http://localhost:8000/life/@Cambarus_hamulatus=1013834
- Again, database src is listed as eol (src=30) and eol no longer has that name for this species https://eol.org/pages/344552/names
Maybe a side issue, but searching for 'Broad-leaved lavender' doesn't find anything at all. I think there may be a general issue with searching for names that contain a dash. But we should open a separate issue for that.
I do think that popularity needs to be part of the formula. I have a very simple branch I created a while ago that integrates popularity: https://github.com/wolfmanstout/OZtree/tree/popularity_ranking
In a nutshell, I adjust the search ranking using a popularity factor of 0.5 to 1.5 for each species based on its popularity rank, so that the average adjustment is 1 when averaged over all species. My thinking was to try to avoid bias towards species, but it doesn't really work because half the species do get a bump over the higher taxa, so species will still dominate the top K results for any small K. I don't think it would really fix it to apply an average popularity to taxa either, because any large taxa would be dominated by relatively unpopular species.
Maybe the right approach, as I think James Rosindell may have been alluding to this morning, is to sum the popularity of species for any higher taxa. I'm not sure offhand whether a simple sum would work, or whether we would need a more sophisticated formula that sums the inputs to popularity (I would have to take a closer look at how popularity is calculated, e.g. to avoid double-counting).