markbind icon indicating copy to clipboard operation
markbind copied to clipboard

Auto-search content of pages

Open damithc opened this issue 6 years ago • 13 comments

Current: only page titles and specified keywords in the frontmatter appear in search results.

Suggested: also include other content in pages for search results

damithc avatar Apr 25 '18 11:04 damithc

Raising priority as full-text search can greatly enhance the usefulness of a content-heavy website.

I don't mind full-text search is a separate page altogether and takes some time to load (i.e., if the full search index needs to be downloaded to the Browser first)

damithc avatar Dec 14 '18 11:12 damithc

Are we open to integrating existing solutions for full-text search?

Docsearch (free, open-source):

DocSearch will crawl your documentation website, push its content to an Algolia index, and allow you to add a dropdown search menu for your users to find relevant content in no time.

amad-person avatar Jan 29 '19 09:01 amad-person

Are we open to integrating existing solutions for full-text search?

Ideally, we should have a decent built-in solution and the ability to integrate other third-party solutions.

damithc avatar Jan 29 '19 09:01 damithc

As discussed with @marvinchin today, Marvin is planning to explore using the Lunrjs library to implement a built-in full text search. This library is also used by MkDocs.

yamgent avatar Jan 30 '19 03:01 yamgent

Are we still looking to have built-in full text search for V2? 😅 I'm not sure that I can finish it by the end of the semester.

marvinchin avatar Mar 13 '19 10:03 marvinchin

Are we still looking to have built-in full text search for V2? 😅 I'm not sure that I can finish it by the end of the semester.

Good to have, but not necessary. Same for the FOUC problem. Both have a good-enough workaround but not a full-fledged solution.

damithc avatar Mar 13 '19 10:03 damithc

I've just published an almost year long project originally motivated by this issue:

  • https://github.com/ang-zeyu/morsels
  • https://ang-zeyu.github.io/morsels/

It consists of a cli file indexer (integratable by copying the binary similar to what we do for plantuml.jar), a search library powered by wasm (rust), and search ui (typescript).

It deals with the issue in 2 aspects:

  • Scalability

    I don't mind full-text search is a separate page altogether and takes some time to load (i.e., if the full search index needs to be downloaded to the Browser first)

    I found this issue to be common to many static site generators using lunrjs / some other client side search solution; This was my primary motivation in creating this project (see https://github.com/olivernn/lunr.js/issues/222 and discussion here https://github.com/rust-lang/mdBook/issues/51 for example), although it turned out to be a secondary plus in the end.

    The primary approach / difference here as such is fragmenting the index into many separate files; At search time, only files needed (by what's searched) are retrieved.

    The indexer is also created in rust as such (:star: indexes the entire 2103 site in 0.5s!). As well as the search library (wasm using rust). Alternative js-based implementations were also trialed and tuned for both; The performance differences are significant.

    This does mean a relatively larger binary / bundle size (334KB gzipped wasm file), something I'm still working to improve (the silver lining is that search hopefully isn't the first thing (within 1-2s of page load) users activate)

  • A complete e2e search solution

    Due to minor implications of scalability in the internal design, I also ended up creating an entire search user interface library. To my knowledge there aren't many "complete" (indexer -> search library -> ui) solutions around (barring algolia docsearch which is an entirely different beast).

Haven't really marketed it as I'm still tying up some things (e2e tests, getting windows defender to stop flagging the executables as viruses, some more bugs), but could look into integrating it here sometime 😃.

ang-zeyu avatar Dec 10 '21 04:12 ang-zeyu

I've just published an almost year long project originally motivated by this issue:

  • https://github.com/ang-zeyu/morsels
  • https://ang-zeyu.github.io/morsels/

Nice work @ang-zeyu Let's aim to integrate it to MarkBind in due course.

damithc avatar Dec 10 '21 05:12 damithc

I'm increasing the priority because Algolia DocSearch is undergoing a major revamp and they haven't been able to provide the search support for our module websites this semester so far. The sooner we reduce reliance on third-party search the better.

damithc avatar Jan 15 '22 13:01 damithc

If anyone would like to take up this issue, please feel free, I think this would be a rather fun thing to do. The library I mentioned above is more or less ready for use. I am currently just doing a fun infinite loop of "making it better and more marketable" but not actually doing any marketing 🤔😅

I came across several related alternatives as well in the course of doing this as well you can consider. All of them follow a CLI + wasm frontend architecture:

  • Stork
  • TinySearch
  • Pagefind - very recent. closest cousin of the library above. It implements the same idea of sharding index files. Currently, the main reason you might not want to use this is the downside of extra network request latency (it does not have the option of not-fragmenting index files), whereas my library by default does not shard index files but offers said option, the reason being to cater to the larger majority of use cases which do not need sharding. (includes 2103 site which generates just ~3MB index) This greatly improves search latency.

Please don't let my selling here from stop you from exercising your own judgement as well. Feel free to come to your own reasoning, and choice, and post back here. I would love to hear your thoughts.

ang-zeyu avatar Jan 15 '23 06:01 ang-zeyu

Some non exhaustive guidelines for implementation:

  • Consider how to map our current keywords feature to the new solutions
  • We currently don't delete files when they have been removed during live preview. This will likely be necessary for any file based indexing solution you choose to maintain the "state" of content accurately.
  • Obviously, the UI component needs to be adapted
  • Old header indexing code should be removed

ang-zeyu avatar Jan 15 '23 06:01 ang-zeyu

Hello I've been looking at this issue and one problem I've encountered is how contents in components that are hidden to the user during the initial render (e.g. Panels) are not included in the search results. This is because libraries like Pagefind indexes the content only after the HTML files have been built. This rendering problem is also faced by other plugins like dataTable (@Tim-Siu) and Mermaid (@yiwen101 @LamJiuFong)

This behaviour is also similar to the Algolia DocSearch we use now that automatically adds algolia-no-index to content hidden by MarkBind's Vue components, causing content hidden in panels to similarly not show up in search results.

With this in mind, I'm just making sure if the behaviour of the results of the full text search we want to implement should include content that are included in panels, or it is ok for them to not show up in the search results

jingting1412 avatar Mar 20 '24 09:03 jingting1412

This behaviour is also similar to the Algolia DocSearch we use now that automatically adds algolia-no-index to content hidden by MarkBind's Vue components, causing content hidden in panels to similarly not show up in search results.

With this in mind, I'm just making sure if the behaviour of the results of the full text search we want to implement should include content that are included in panels, or it is ok for them to not show up in the search results

@jingting1412 I think it is fine (even necessary) to omit content from collapsed panels. But we can index content from expanded-by-default panels, right?

damithc avatar Mar 20 '24 15:03 damithc