Search for a package with its verbatim name still does not display the package in first position
When I search for flutter_animate, a well-maintained package with high stats, the default ("Sort by search relevance") search result does not display the package in first position.
Instead, it is shown somewhere down below, currently at 8th position, below a totally unrelated package named map_picker.
The expected result is:
flutter_animatein first position- followed by other packages that are related, e.g. with similar names.
This is a long-standing issue with pub.dev that affects numerous packages.
@crcdng Thanks for reporting it!
@jonasfj @sigurdm:
This seems to be a tough case, because for the given search query:
- it is not the most liked https://pub.dev/packages?q=flutter_animate&sort=like
- it is not the most downloaded https://pub.dev/packages?q=flutter_animate&sort=downloads
- it does not have maximum points (150/160) while other packages have that
I'm 95% certain that if those static analysis points were fixed, it would be on the first few spots, likely on the first.
I'm inclined to say that the points in this case are causing a bit more divide in the ranking than warranted. Maybe we should dampen the difference somehow?
To me it seems fair that the package is not at the top, given that it is not the most popular by downloads nor the most liked nor has the most analysis points. This looks like it is working as intended.
We do show the exact package name match at the top:
To me it seems fair that the package is not at the top, given that it is not the most popular by downloads nor the most liked nor has the most analysis points. This looks like it is working as intended.
We do show the exact package name match at the top:
It's funny - I did not see the "Matching package names..." until mentioned.
I would still argue that a package repository should always display the (one) direct match in the first place regardless of metrics.
I would still argue that a package repository should always display the (one) direct match in the first place regardless of metrics.
I think this was once the actual behavior, and was decided against. There can be abandoned packages that happen to have squatted some keyword. They should not be shown at the top of the list IMO.
We show direct matches, and we show a ranked list of matches.
I believe having the match in the package name ranks the package slightly higher (@isoos please confirm) but in this case it seems to be not enough.
I agree with @isoos that if the package score was fixed this package would most likely rank much higher.
I believe having the match in the package name ranks the package slightly higher (@isoos please confirm) but in this case it seems to be not enough.
Right now it ranks the same as if the description or the topics had the same keywords, so not entirely higher. (Similar to the exact matching, this was higher before, but other search examples prompted us to lower the match significance).
Imho it's worth comparing to npm which despite having lower number of downloads includes the exactly matched library at the very top with a tag exact match
@orestesgaolin: I'd llike to emphasize again: we actually had it in the first position as part of the regular results, but then we got complaints: the package was no longer relevant for the query, despite the name match. Hence the current compromise with exposing name matches separately. It is not clear which solution is really better.
Maybe introduce some threshold that decides whether the exact match is still relevant? E.g. if it got an update in the last x months, got liked n times in the last n months or something like this. So we could skip the name squats but otherwise put the exact match at the top, in a manner that we all scan for, so a normal row with package info, not a small paragraph with a small link :)
@Albert221: We are not removing the package from the result list. If it is relevant, it will be listed in (one of) the top position(s), and at the moment we don't have yet-another-relevancy score to up-or-downrank special cases.
FWIW, this has tripped me up several times as a user. I overlook the "Matching package names" line and then only see the one or two leading packages being different from what I'm searching for. I almost had a heart attack when this happened the last time — I was searching for package:xml and got this:
https://pub.dev/packages?q=xml
If you're not trained to look for the exact match line, you can easily think that the package has been discontinued or removed or something.
Only after remembering that this has happened before and scrolling down did I find the actual package.
I think we could rephrase this problem and ask for more explicit presentation of directly matched package. Perhaps card-like style similar to all the remaining entries could help with discoverability. This way we could avoid tweaking the search relevance algorithm and just apply relatively straightforward UI change.
Despite using pub.dev daily, I get confused all the time. It could make an interesting case in ux study ;)
FWIW, this has tripped me up several times as a user. I overlook the "Matching package names" line and then only see the one or two leading packages being different from what I'm searching for. I almost had a heart attack when this happened the last time — I was searching for
package:xmland got this:
exactly, the current design creates a perceptual trap similar to this one:
One obviously needs to take into account packages that don't "deserve" to be listed first for a given search term. For example, if there's a Foo Database and, before it gets its own quality, official API (pkg:foo_database), someone else releases an inferior pkg:foo, you don't want the exact match to be forever number one in the search results.
So this has no simple solution. But I'd propose tweaking the weight of exact match, or the formula that combines the different signals together. You generally want high quality, popular packages with exact match (like pkg:xml) appear first even when there are packages with higher scores that also match the query.
@jonasfj you looked at the top queries, and concluded that most search on pub.dev indeed is for a known package. Did I get this right?
If that is indeed true, then I agree, we should probably do either or both of
- styling the "exact match" better to make it more obvious
- giving higher ranking for a close-name-match
I want to emphasize that no other package repository I have interacted with so far, be that pypi, npm, crates.io, lib.rs or any Linux Distro GUI package manager frontend has had this problem.
Neither the bad sorting, nor the weird ux where the exact match is hidden in a small text.
Here's an example for weird ordering
And here's an example for way too aggressive fuzzy search
It is a different tradeoff I guess. If I search for 'xml' on crates, I indeed get the 'xml' crate as the first result. But judging from the download counts I would probably rather be using 'fast-xml', and thus I think that package should be listed higher.
I still think showing the exact match as a separate thing is useful - but agree we should style it to be more noticable. https://github.com/dart-lang/pub-dev/pull/8573 is a start here, but I think we should do even more.
Another example "yaml":
Totally agree there's a trade off. (See my parable above about pkg:foo_database versus pkg:foo.) I don't think it's a good idea to always show the exact match at first place.
But I also think there's something fishy about the current formula. Because for the search term xml, you get this:
The exact match pkg:xml has 411 likes and 4 million downloads. It's number 3 in the search results. The two packages that are above it have a fraction of the likes and downloads, and they're not as relevant (gpx is for GPS data in XML form, xml2json does what it says on the tin; xml is for parsing and building XML). It almost looks like the few additional pub points have an outsized influence on the ranking?
If that's so, I suggest decreasing the effect of pub points. Sure, a package with 60/150 pub points should not be #1 in search results as long as there's almost any other package. But when packages reach some reasonable level, the pub points signal is much less important. A single info-level lint is clearly not something that should dethrone a million-downloads-per-week package that's used in 459 other packages.
We (really @isoos) are trying to make the difference caused by the last few pub points less in https://github.com/dart-lang/pub-dev/pull/8572 we'll see the effect when that is deployed. (I think this can be seen on staging already https://staging.pub.dev/packages?q=xml)
We (really @isoos) are trying to make the difference caused by the last few pub points less in #8572 we'll see the effect when that is deployed. (I think this can be seen on staging already https://staging.pub.dev/packages?q=xml)
Comment: it's better, especially as the current result has something rather unrelated on top position.
I still wonder, how can a package that
- matches the search intention / search term exactly
- has more than 3x the likes
- approx. 20x the downloads
In other words a vastly more significant search result appear on place 2 merely because it scores 3% less on some rather complex and arbitrary "points" metric?
I still wonder, how can a package that matches the name exactly has more than 3x the likes approx. 20x the downloads In other words a vastly more significant search result appear on place 2 merely because it scores 3% less on some rather complex and arbitrary "points" metric?
@crcdng: to answer the specifics:
- Exact name match has the same score as if the string was found in the package description or its topics (this is to de-emphasize the importance of the names so people won't fight over the "best" names).
- The likes and download counts are not scored linearly: the 100M downloads should not worth 100x times than 1M downloads or 1000x times than 100k downloads. Instead, we order them in increasing order, and score 0.0-1.0 along the packages linearly: the least amount of downloads gets 0.0, the median download will get 0.5, the top download will get 1.0. Same with likes.
We then combine the like, the download count (50-50% right now) into a merged score, and then combine it with the pub points (which is 0.0-1.0 depending on the given points / max points). I've started tuning the later, effectively compressing the high end of that range.
It is important to note that while you may find a few compelling queries to blindly promote the exactly matching package to the top, we have quite a lot of examples where this is not ideal at all. It will be always a balance.
I still wonder, how can a package that matches the name exactly has more than 3x the likes approx. 20x the downloads In other words a vastly more significant search result appear on place 2 merely because it scores 3% less on some rather complex and arbitrary "points" metric?
@crcdng: to answer the specifics:
Thanks for the explanation. I understand this is a complex task. I am mostly judging from the current results (I encountered really many examples before filing the bug report) and I think there will be improvement. I'm still not convinced the "weights" will be calibrated correctly.
To reiterate, when I put "xml" in the pub.dev search field, my intention is to find out if / what kind of xml library or libraries are there for Flutter / Dart. Or I might have read about or heard of a Flutter "xml" package. Now I want to check it out.
In particular, I am NOT searching for a xml to json converter package (I would have entered something like "xml json") and I am clearly not searching for a package to "load, manipulate, and save GPS data in GPX format" a package that happens to be based on XML. Here my search would have mentioned something GPS or geodata related.
And the fact that on some metric the package I am actually looking for ranks lower than the two that I am not looking for, but are returned above it doesn't matter, because I am not looking for these other two packages, as explained above. Therefore the current metric comparison between these packages really makes no sense. Taken to the extreme, your strategy could be to always return the highest metric package regardless the search term in order to ensure good quality results.
I think putting such high weights on the "points" metric makes sense when we don't have an exact match.
Now your point is, if I'm correct, that someone would publish a package called "a" that does "b" and that therefore should not be the first result looking for "a". It makes sense when the ranking punishes that behaviour. But if that treacherous package would have proper points, it would still be top-ranked, so clearly this is not a solution to the problem. In the current example "xml" is missing 10 points on static analysis, which is not an indicator of a match between package name and package purpose at all.
Exact name match has the same score as if the string was found in the package description or its topics (this is to de-emphasize the importance of the names so people won't fight over the "best" names).
I wonder: is the idea of tf-idf (Term Frequency-Inverse Document Frequency, or "term specificity") applied to the pubdev ranking algorithm? If not, that might be the actual problem. I'm no expert but from what I understand, tf-idf exists to de-emphasize documents such as "xml 2 json" or "GPS XML" when what you're looking for is "xml".
I think we all agree that just blindly showing exact matches for every query would just lead to more name-squatting and worse results overall. But yeah, flutter_animate and xml probably deserve to be number ones for their exact-match queries.
Now your point is, if I'm correct, that someone would publish a package called "a" that does "b" and that therefore should not be the first result looking for "a". It makes sense when the ranking punishes that behaviour. But if that treacherous package would have proper points, it would still be top-ranked, so clearly this is not a solution to the problem.
Such nefarious package wouldn't be top ranked without other indicators, e.g. likes or downloads. Of course one could game that with enough dedication, but what's the point if the package otherwise is crap? It is not that one can get massive benefits by gaming search like that.
In the current example "xml" is missing 10 points on static analysis, which is not an indicator of a match between package name and package purpose at all.
To be fair, there is always some ranking that will result in similar differences: e.g. a new fork of an abandoned package may be fixing all the obsolete stuff and errors, and if you want to search for its name or features, you may want to get the new one with more points but less downloads to be ranked higher. Unfortunately there is no single metric that gets every ranking just right.
Aside: the package:xml case is also interesting, because the author already fixed the linter issue 2 months ago, they just haven't release it as part of a new version (or like any release in 15 months):
https://github.com/renggli/dart-xml/commit/ac5cd8c235f3dfd54b44b4af4513b2dbdb6c6a87
is the idea of tf-idf (Term Frequency-Inverse Document Frequency, or "term specificity") applied to the pubdev ranking algorithm?
Yes and no: we had something like it, but over the years it got diluted/compressed/changed, and I wouldn't call it tf-idf anymore.
The reasons are the usual: over the years we had similar discussions like this thread, prompting us to change/tune the ranking algorithm in use, and it drifted to its current state. The reluctance to swiftly change the ranking algorithm comes from this experience: we have seen very different ranking preferences and requests already, and did our best to fix the algorithm to accommodate the goals of the users. We intend to do the same here, but the solution may be different than what comes to mind in the first place.
@isoos @sigurdm I looked at #8573 and how it appears in the staging website
I think the problem people are having in this thread is that the exact match result looks like a lot more like a "Did you mean ___?" prompt than a package result at first glance. It's just a linked word and doesn't have any of the package details like the other results do. I strongly think following npm is the right move here, with how they still include the entire package information, but explicitly note that it's at the top because it's an exact match (copying the screenshot from earlier in the thread)
This has been a UX issue for me as well. I never noticed the exact matches so I'd resort to google to find the package.
I think improving the UI for the exact matches is all that's needed for a fix. If that's more clearly called out in the UI then I don't think this is really an Issue.
@sigurdm Doesn't always seem to work for exact match, searched for drops package and couldn't find it in the search result nor in the exact match Drops: https://pub.dev/packages/drops
@hamza-imran75: Thanks for reporting this, as this surfaces a bug where we don't lowercase the search expression for exact package name matching.
However, we still won't move the exact name matches to the first hit spot, e.g. in this case package:drops is a relatively young package (less than a month old), it has not too many likes, not too many downloads. Once those catch up, it will move up in ranking position. (Also needs to fix the scores too.) (Aside: it starts with a typo in both the description and the readme: pacakge, but this shouldn't influence its general ranking).
However, we still won't move the exact name matches to the first hit spot, e.g. in this case package:drops is a relatively young package (less than a month old), it has not too many likes, not too many downloads. Once those catch up, it will move up in ranking position.
While I agree that a search engine should surface good packages that are worth using, I'd still argue that a very basic requirement for a search engine is that it finds what you're looking for. Obviously, searching "database" and expecting to find cloud_firestore can be considered a goal, and searching through the description and surfacing high-ranking packages can be a means to that end, but that's a case where the user is relying on the search engine to do the hard part of choosing a good database for them.
In the much simpler case where a user already knows exactly what package they're looking for, it doesn't make much sense (at least, to me) for the engine to basically ignore that result completely. Packages don't start at high popularity, so this can negatively affect packages that are trying to gain users -- it must be hard to advertise if Pub barely even shows your package.
In any case, I find it strange that using the search bar to find a package you know the exact name to can take more time than just hand-typing in a URL like https://pub.dev/packages/package_name yourself. When I'm looking for changelogs or API docs, I always use URLs as I know it'll get me where I want, whereas I can't say the same for the search bar
@Levi-Lesches: I'm curious what you think about the following example:
There is a package without any content called mysql and searching for mysql won't promote it on the first package spot, only at the exact package name matches. If we were to always promote exact name matches, it would certainly downgrade the user experience for this query. If we were to come up with arbitrary rules about when to no promote to the first spot, it would cause inconsistency and possible frustration in certain cases.