jupiterbroadcasting.com icon indicating copy to clipboard operation
jupiterbroadcasting.com copied to clipboard

Search

Open gerbrent opened this issue 3 years ago • 23 comments
trafficstars

figure out a good way to integrate search. Clientside like Lunr.js will probably not perform due to index size.

@theZMC recommends:

Since golang seems to be a theme here (which it should be), maybe zinc would fit the bill?

gerbrent avatar Jul 05 '22 01:07 gerbrent

If zinc is used (from my understanding) it looks like it'll need to be implemented on the server side (at least for the server which holds all the references to the objects/records).

Also, of note it appears that zinc is still in beta (src):

Project Status: ZincSearch is in Pre GA (General Availability) and will be marked as production ready at v1.0.0 .

So, it's possible they'll have major breaking changes before the 1.0 release, which means we'll need to make sure we pin the version and read the changelog before upgrading to see if anything's going to break.

While it's not a self-hosted option (and will probably cost money because of how many episodes JB has), as a temporary solution, we could use algolia. I've used them before, and overall it was pretty easy. I've actually got a GH action for doing CI with hugo content as well: https://github.com/Climate-Refugee-Stories/crs-website/blob/c82f394a620b4631bb43de5ca4433d33a51bb292/.github/workflows/cd.yml#L91-L126 (figure we probably won't go with this, but just figured I'd mention it).

elreydetoda avatar Jul 18 '22 05:07 elreydetoda

Meta: BTW, @gerbrent you might want to add a "JB - action needed" tag to this issue since it's discussing cost of running an extra service on a server specifically for search, and if that's something they want to contemplate (because that'll be another service to maintain).

elreydetoda avatar Jul 22 '22 03:07 elreydetoda

Typesense might be a good option

reesericci avatar Aug 04 '22 19:08 reesericci

@realorangeone usually has some pretty strong opinions on search.

ironicbadger avatar Aug 15 '22 16:08 ironicbadger

I have opinions aswell, from a functionality and end-user perspective.

The search results at notes.jupiterbroadcasting.com is not at all to my liking pretty much every single time I try to use it, which is generally answering a question like "I remember we mentioned that in the last few months, lets see which episode that was from" - This generally gives results sorted by "relevance" which never gets me what I want (and yet I keep trying.....)

I would far prefer chronologically sorted search results.

  • see https://github.com/selfhostedshow/show-notes/issues/16

Also on a slow connection, the UX of that current search - the present-results-as-pop-up-in-search-bar behaviour isn't obvious for quite some time till the results load. An annoying UX experience, and slow enough to make me wonder more than once "is this working?"

gerbrent avatar Aug 15 '22 20:08 gerbrent

I'm willing to start putting some serious development work into this. Some clarifying questions:

  • Do we need full text search?
  • Any plans for some sort of transcription process (automatic or manual) and if so is there a timeline for when that would be in place?

theZMC avatar Aug 15 '22 22:08 theZMC

See the search at notes.jupiterbroadcasting.com - that's the type of thing I think is needed. @ChrisLAS @noblepayne or @gerbrent feel free to jump in here.

ironicbadger avatar Aug 15 '22 23:08 ironicbadger

see here why search via recency is not supported nor desired by mkdocs:

https://github.com/selfhostedshow/show-notes/issues/16#issuecomment-1216728294

gerbrent avatar Aug 16 '22 14:08 gerbrent

I agree lunr probably isn't ideal, as the index will be huge. I've written client-side search with Hugo, and it's very simple, but the index may be large given the show history

mkdocs's search is lunr-based. The issue with mkdocs is that the pages have no sense of date, as opposed to being a technological issue. Hugo however does have dates as a concept, so could be done.

For search, I suspect we'd want want something server-side to do it. For ease (of local dev and hosting), scraping the content into sqlite and using its fulltext search would probably be very simple, very powerful and scalable.

Elasticsearch etc are definitely options, but they're very heavy for what we need. As are hosted tools like Algolia, but given the name of one of our shows, that's a less desirable option.

RealOrangeOne avatar Aug 16 '22 15:08 RealOrangeOne

Could we run some mock ups with elastic and get a sense of just how heavy? We have the infra to do it I'd wager.

ironicbadger avatar Aug 16 '22 15:08 ironicbadger

It's not just heavy in terms of resource. It also makes local development much more of a pain, not to mention is more complex to setup and work with anyway. The container alone is ~550mb compressed.

RealOrangeOne avatar Aug 17 '22 14:08 RealOrangeOne

It's not just heavy in terms of resource. It also makes local development much more of a pain, not to mention is more complex to setup and work with anyway. The container alone is ~550mb compressed.

Unfortunately when it comes to search, I think it's the classic pick two between fast, good, and inexpensive. Though I do agree that full fat elastic is a bit too heavy-handed for our needs.

theZMC avatar Aug 17 '22 15:08 theZMC

So I just found a tool that might be worth using if search is still something we are after. Its called Pagefind and it is a single binary that indexes the site after it is built. There is a video on the home page of their site showing how it works and a basic example.

Its also written in :tada: Rust :tada:

CGBassPlayer avatar Sep 08 '22 15:09 CGBassPlayer

That looks pretty awesome! It's nice that we could just bundle that in an artifact as well. It just comes with the new site build! 🥳

elreydetoda avatar Sep 08 '22 16:09 elreydetoda

this DOES look fascinating!

The demo at the top of the page at https://pagefind.app/ is fast - much faster than our current notes.jupiterbroadcasting.com for me on a low end internet connection and low-end hardware.

Pagefind can run a full-text search on a 10,000 page site with a total network payload under 300KB, including the Pagefind library itself. For most sites, this will be closer to 100KB.

that sounds like us ; )

🎯 Another lovely demo: https://xkcd.pagefind.app/

gerbrent avatar Sep 08 '22 18:09 gerbrent

My big question - can results be sorted by date/recency? I see Pagefind has the concept of "date"

image

gerbrent avatar Sep 08 '22 18:09 gerbrent

I don't see why not since it is content on the page. I wonder if we will need a piece of metadata for the date.

But I found this tool about 15 minutes before I commented (Just long enough to watch the video)

CGBassPlayer avatar Sep 08 '22 18:09 CGBassPlayer

That can be very handy for the JB Archive (a distinct hugo instance):

Pagefind can be configured to search across multiple sites, merging results and filters into a single response. Multisite search configuration happens entirely in the browser, by pointing one Pagefind instance at multiple search bundles.

The following examples reflect Pagefind running on a website at blog.example.com that wants to include pages from docs.example.com in the search results.

https://pagefind.app/docs/multisite/

Changing the weighting of individual indexes

When searching across multiple sites you may want to rank each index higher or lower than the others. This can be achieved by passing an indexWeight option for each index:

https://pagefind.app/docs/multisite/#changing-the-weighting-of-individual-indexes

gerbrent avatar Sep 08 '22 18:09 gerbrent

Hello all! Have you considered https://www.meilisearch.com/ ? It's also an open source project with a valid source of income (they have recently received 15M round o founding). It is very easy to deploy. I'd be more than happy to write backend RSS watcher and some mockups for front end. As for costs times are crazy but I think I can commit to covering a year of runtime and on call support as value for value 🥰

FlakM avatar Oct 30 '22 05:10 FlakM

oh wow, very generous @FlakM !!!

I'll be curious to hear what others think of MeiliSearch - def worth considering!

gerbrent avatar Nov 01 '22 13:11 gerbrent

I've been recently reviewing alternatives for more traditional ELK stack and hosted options for my employer. Meilisearch has come up on this week in rust so I have also looked into it. Here are some reasons why I think it would be a good fit here:

  • It uses a very solid backing technologies - ie LMDB which has been designed as a embedded database for openldap by very smart people. If you prefer podcasts here is a great episode about it.
  • It has a rest API so it could be used without any other backend services apart from the component that will keep data in sync (and maybe some nginx to add some rate limiting/tls etc)
  • It is dead simple to deploy and maintain - just a single container
  • It has a front-end code already written so including it is also very simple
  • It has all of those nice features like typo safety, synonyms etc
  • It is blazingly fast :rocket: :crab:

For your convenience, I've deployed a sample service and loaded the index with contents of all feeds RSS its available here (BTW it is a proper use of Linode credits) secret key is MASTER_KEY. Keep in mind that it is a result of a fast and dirty effort. For a full-blown index, I think it would be useful to also add transcription (I've experience with deep speech so probably not a big problem) and more complete show notes. The current showcase version of code loading data from the RSS feed is available here

FlakM avatar Nov 02 '22 07:11 FlakM

amazing again @FlakM !! Will look at this further in a few days.. thank you!!

gerbrent avatar Nov 03 '22 17:11 gerbrent

@kylepotts suggests start of convo & end of convo:

What things have we tried for searching transcriptions? I wonder if taking the output of the transcription and putting it inside something like ElasticSearch/Opensearch and exposing it via an API is overkill? Or if a product like that already exists. Definitely will require a unique way to have a "dynamic" results page in Hugo from where you search.

elreydetoda avatar Feb 19 '23 13:02 elreydetoda