kit icon indicating copy to clipboard operation
kit copied to clipboard

add keywords to docs for better search results

Open Rich-Harris opened this issue 2 years ago • 41 comments
trafficstars

Describe the problem

To use @tcc-sejohnson's example: if you search for adapter-static in the docs, the page you're probably looking for — this one — is the fifth result:

image

Describe the proposed solution

I think the easiest and most reliable solution would be to add keywords frontmatter to the relevant markdown files, so that if you match one of them (or a keyword starts with your search term) that document is treated as higher priority than all others.

We could indicate the keyword in the UI somehow but I don't think it's necessary.

Alternatives considered

No response

Importance

nice to have

Additional Information

No response

Rich-Harris avatar Jan 25 '23 01:01 Rich-Harris

3 of the 4 documents that rank above it don't contain adapter-static a single time. It must be tokenizing it into "adapter" and "static". Perhaps we can either remove - as a delimiter character or special case the adapter names to be treated as a single word

benmccann avatar Jan 25 '23 03:01 benmccann

I think there's another bug as well:

https://github.com/sveltejs/kit/blob/f953c9d810be8b9211ce1fa456d9c96224ec55dc/sites/kit.svelte.dev/src/lib/search/search.js#L64

The problem is that sub-sections rank lower than main pages.

https://kit.svelte.dev/docs/adapter-static#usage - because it has a # is automatically pushed to the bottom https://kit.svelte.dev/docs/configuration - despite not even having the text adapter-static jumps to the top because there is no#

It should probably be grouping followed by ranking. I.e. we group by the page and then rank based on the highest ranking sub-section or something like that.

benmccann avatar Jan 25 '23 06:01 benmccann

Might it be best to implement 3rd party search? Algolia is free for open-source, and does a great job of indexing and ranking...

https://www.algolia.com/for-open-source/ https://www.algolia.com/doc/tools/crawler/getting-started/overview/

Edit: Oh, not quite free: 200,000 search requests per month - still, maybe worth budgeting for.

oodavid avatar Jan 25 '23 14:01 oodavid

I've been meaning to write a blog post about this, but there's a variety of reasons we don't want to use third party search tools:

  • We value your privacy. While we don't have any particular animus or distrust towards Algolia, we don't want to be in the position of trusting any third party to handle your data responsibly, and that includes your search history
  • We don't want to cede control over the UI or the search results. While it's arguably true that Algolia will have better out-of-the-box results than our homegrown setup (which uses flexsearch), we have the ability to improve it and tailor it as we see fit, which we'd lose if we had something generic
  • Search should work without JavaScript, especially when the framework in question preaches progressive enhancement. Ours does (https://kit.svelte.dev/search?q=hello), every other framework's doesn't, because they use Algolia
  • If you use Algolia, every keystroke results in a network request. Depending on where you are relative to Algolia's servers, that will result in latency; regardless of where you are it will result in unnecessary data usage
  • By extension, search ceases to work if you lose connectivity. The SvelteKit docs don't currently work fully offline, but it's a medium term goal
  • It takes time to index a site. With our approach, search is 100% up to date for every deploy, even preview deploys. That's not true for any site that uses Algolia

Rich-Harris avatar Jan 25 '23 17:01 Rich-Harris

The only bullet point I'd comment on before you write this blog post is:

We don't want to cede control over the UI or the search results. While it's arguably true that Algolia will have better out-of-the-box results than our homegrown setup (which uses flexsearch), we have the ability to improve it and tailor it as we see fit, which we'd lose if we had something generic

Flexsearch is incredibly hard to customize relative to Algolia, Elastic, or just about any index I've used in the past. I've spent the morning trying and simply can't understand how Flexsearch's scoring works. I've filed a few issues in the Flexsearch repo asking for more details and hope to come back to this after getting some more details about how to tweak Flexsearch.

In the meantime, I've sent a PR which just does some housekeeping on our side: https://github.com/sveltejs/kit/pull/8727

benmccann avatar Jan 25 '23 18:01 benmccann

Right, but we could swap out flexsearch for something else if we needed to. Hell, we could write our own!

Rich-Harris avatar Jan 25 '23 19:01 Rich-Harris

A very well reasoned response. Personally I'd put results relevance above all of those points.

I've had some success in the past with Typesense, IIRC it has a rational approach to ranking and relevance. Might be worth a peek:

https://typesense.org/docs/guide/ranking-and-relevance.html

Flexsearch has a list of other libraries, benchmarked:

https://nextapps-de.github.io/flexsearch/bench/

oodavid avatar Jan 25 '23 21:01 oodavid

Typesence looks really cool @oodavid.

Could we try implementing it? I'd like to participate.

enBonnet avatar Feb 04 '23 17:02 enBonnet

Right, but we could swap out flexsearch for something else if we needed to. Hell, we could write our own!

LunrJs is also good and is flexible enough with a good documentation. Other alternatives might be stork.js and fuse.js.

Hetarth02 avatar Feb 04 '23 19:02 Hetarth02

LunrJs is also good and is flexible enough with a good documentation. Other alternatives might be stork.js and fuse.js.

There are a lot of options, we should put our focus on the problem we wanna resolve and look at which one is the best for doing it.

The current problem seems to be the priorities.

enBonnet avatar Feb 05 '23 20:02 enBonnet

I'm open to alternatives as I don't particularly like flexsearch, but it'd be nice to find one that allows us to keep the functionality that we have today. In particular today you can see results as you type and many of the tools mentioned above don't appear to support that. The search we use today also does not require any extra infrastructure. I'm not sure if any of the tools mentioned are great fits, but would love if someone can find one that fits the bill.

  • typesense - can do prefix-based search. appears to require running search server. is it going to require extra infrastructure or can we run as a serverless funtion on vercel? I see a next.js example but it uses typesense's cloud. It won't work offline in any case
  • lunr - uses bm25. unclear if you can search based off prefix
  • stork search - has markdown and frontmatter support. appears to index substrings, but not sure you can do prefixes. can boost titles, but otherwise not sure you can control ranking. wasm. isn't on npm. wants to manage the DOM by default, so need to use advanced search method to build your own interface
  • fuse.js - has weighted search. unclear if you can search based off prefix

benmccann avatar Feb 06 '23 19:02 benmccann

I am not biased towards lunrjs but I have been working with it currently and I think it checks off all your requirements.

  • [x] See search results as you type

You can build the indexes once on the initial search on client side(not a good idea) or you can load the pre-built indexes file during search. For example, see Julia docs they build the indexes on client side for the initial search.

  • [x] Does not require any extra infrastructure

One can use github's CI/CD to pre-build the index file on every push.(I am actually work on this issue in Documenter.jl)

@benmccann I didn't get the "search based off prefix" part can you please explain?(If possible with a small example)

Hetarth02 avatar Feb 07 '23 01:02 Hetarth02

What I mean by search off a prefix is this... Imagine that you're typing "adapter". When you start and you type "a" it will show all words beginning with "a", when you get to "ad" it will show all words starting with "ad", and so on. You can see how the search auto-completes in realtime on kit.svelte.dev as you do this.

benmccann avatar Feb 07 '23 02:02 benmccann

What I mean by search off a prefix is this... Imagine that you're typing "adapter". When you start and you type "a" it will show all words beginning with "a", when you get to "ad" it will show all words starting with "ad", and so on. You can see how the search auto-completes in realtime on kit.svelte.dev as you do this.

Correct me if I am wrong but are you perhaps talking about auto-complete?

Would this be something we are looking for?

Autocomplete library by Algolia

Hetarth02 avatar Feb 07 '23 03:02 Hetarth02

It's a bit different than autocomplete. It's not completing your queries. Rather it's doing searches based on partial query strings. E.g. to take the "adapter" example from earlier, the way it works is by indexing "a", "ad", "ada", "adap", "adapt", "adapte", "adapter". This takes a lot more memory, but provides the experience you see today on kit.svelte.dev.

benmccann avatar Feb 07 '23 03:02 benmccann

It's a bit different than autocomplete. It's not completing your queries. Rather it's doing searches based on partial query strings. E.g. to take the "adapter" example from earlier, the way it works is by indexing "a", "ad", "ada", "adap", "adapt", "adapte", "adapter". This takes a lot more memory, but provides the experience you see today on kit.svelte.dev.

I see then perhaps is this what we are looking for,

Wildcards Lunrjs

I think this can reproduce the same functionality you are talking about.

Hetarth02 avatar Feb 07 '23 03:02 Hetarth02

Ah, yes! Thanks for the pointer. Lunrjs may indeed work then!

I'd be happy to review any attempt to switch out flexsearch for lunrjs if anyone wants to take a stab at it.

benmccann avatar Feb 07 '23 03:02 benmccann

Ah, yes! Thanks for the pointer. Lunrjs may indeed work then!

I'd be happy to review any attempt to switch out flexsearch for lunrjs if anyone wants to take a stab at it.

I can try to make a prototype. Can anyone guide me through some of the steps to setup the code for docs locally?

@benmccann @enBonnet

Hetarth02 avatar Feb 07 '23 03:02 Hetarth02

Can anyone guide me through some of the steps to setup the code for docs locally?

You'll need to have pnpm installed, then...

git clone [email protected]:sveltejs/kit
cd kit
pnpm install
cd sites/kit.svelte.dev
pnpm dev

...and you should be off to the races!

Rich-Harris avatar Feb 11 '23 17:02 Rich-Harris

One thing I'll note is that the web worker that powers our current search — which includes all of flexsearch plus our logic that sits around it — is 18kb of unminified code (though it probably should be minified, not sure why it isn't).

By contrast, lunr by itself weighs 99kb. Probably not a dealbreaker but something to be conscious of.

Rich-Harris avatar Feb 11 '23 17:02 Rich-Harris

I suspect you want to keep the search locally on the client, but if you're looking for an alternative to algolia there's meilisearch: https://docs.meilisearch.com - though 11kb minified+zipped

kevmodrome avatar Feb 13 '23 12:02 kevmodrome

lunr is only 29k minified, so it's not too bad. The thing that I just noticed that gives me more hesitation is that it appears to basically be abandoned. It hasn't been updated since 2020, it still uses Travis CI, there's a number of unreviewed PRs, etc. It'd be nice if we could find something that's a bit better maintained

benmccann avatar Feb 14 '23 00:02 benmccann

https://github.com/lucaong/minisearch looks like a promising option. It'd probably be better to try it than lunr

benmccann avatar Feb 14 '23 01:02 benmccann

https://github.com/lucaong/minisearch looks like a promising option. It'd probably be better to try it than lunr

Thanks for you suggestion, I will try to use this.

Hetarth02 avatar Feb 17 '23 02:02 Hetarth02

I was trying out minisearch and elasticlunr yesterday, @Hetarth02 you can continue from those branches if it saves you some setup time.

You'll need to be using Chrome for this btw since Firefox doesn't yet support module workers.

gtm-nayan avatar Feb 17 '23 03:02 gtm-nayan

@gtm-nayan Thanks for your help, by the way any noticeable results you got from using minisearch. Also, if you want we can co-ordinate with each other and work on this topic together.

Hetarth02 avatar Feb 17 '23 03:02 Hetarth02

Minisearch gives out a lot more results than our current setup but I think that's due to the combineWith setting, changing it to "AND" reduces the number of results but there's no way to do that on a per-field basis. Minisearch did improve the query originally mentioned in this issue ie. searching for adapter-static leads to the static site generation page, and I didn't see any glaring problems yet but still have to test for other common queries.

gtm-nayan avatar Feb 17 '23 04:02 gtm-nayan

There's an up-and-coming in-memory search engine fully build from the ground up to be performant for full-text search, called Lyra. The project seems quite intuitive and the people behind it are constantly improving it. It might be worth giving it a shot for the docs 🤔

boian-ivanov avatar Mar 01 '23 09:03 boian-ivanov

Here's a playground of sorts for lyra, now called orama, https://stackblitz.com/edit/stackblitz-starters-eraanr?file=index.mjs

run node index.mjs "query goes here"

would be great if folks could help with the evaluation, i.e. compare the results it gives for something you searched recently against the current setup on kit.svelte.dev and share the findings here

gtm-nayan avatar Jun 07 '23 12:06 gtm-nayan

I just tried it out:

❯ node index.mjs "ssr"
ssr
[
  '/docs/single-page-apps#prerendering-individual-pages',
  '/docs/types#public-types-server',
  '/docs/page-options#prerender-prerender-and-ssr',
  '/docs/routing#layout-layout-server-js',
  '/docs/page-options#csr',
  '/docs/routing#page-page-svelte',
  '/docs/types#public-types-ssrmanifest',
  '/docs/state-management#using-stores-with-context',
  '/docs/routing#layout-layout-js',
  '/docs/load#universal-vs-server-when-does-which-load-function-run'
]

In the current docs the first result is /docs/page-options#ssr which doesn't seem to be included here in the search results.

karimfromjordan avatar Jun 07 '23 17:06 karimfromjordan