esphome-docs icon indicating copy to clipboard operation
esphome-docs copied to clipboard

Add embedding search

Open KTibow opened this issue 3 months ago • 12 comments

Description:

  • Shows 3 recommended pages when you start typing in the search box image

  • Fast (also doesn't require hitting "go"), better at finding some pages (light -> light component is first result, internet -> shows internet-related components), although doesn't replace the current stuff

  • Uses glove-based embeddings, shipping directly to the browser, no 3rd party. The current search index is ~250kb compressed; this index is ~300kb compressed, including a number of words that might be typed.

    • Too big? Let me know and I'll try to ship less
    • Only should load data upon using the search bar? Let me know and I'll implement that

Merge first: https://github.com/esphome/esphome-docs/pull/3773 Related issue (if applicable): N/A Pull request in esphome with YAML changes (if applicable): N/A

Checklist:

  • N/A I am merging into next because this is new documentation that has a matching pull-request in esphome as linked above.
    or
  • [x] I am merging into current because this is a fix, change and/or adjustment in the current documentation and is not for a new component or feature.
  • N/A Link added in /index.rst when creating new documents for new components or cookbook.

KTibow avatar Apr 21 '24 05:04 KTibow

Deploy Preview for esphome ready!

Name Link
Latest commit 07dd59f8bcde884833875919a86d2470ff7142af
Latest deploy log https://app.netlify.com/sites/esphome/deploys/663af3c1cd036b00083000a4
Deploy Preview https://deploy-preview-3774--esphome.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Apr 21 '24 05:04 netlify[bot]

A few quick thoughts:

  • remove changelogs from the search index
  • some kind of mapping between I²C - I2C, as people won't be filling the search box with superscripts, but would like to get the results
  • maybe prioritize title results to appear first, and text body result appear lower

nagyrobi avatar Apr 21 '24 06:04 nagyrobi

I had some of the same thoughts, but found out I couldn't implement them.

Right now, I'm only searching on the titles. This isn't ideal, but I'm using a simple weighted average to turn word embeddings into phrase embeddings, so including the body might drown out the title. Plus, the fact that we're using embeddings means that related terms will still work.

It turns out Glove doesn't have a word embedding for i2c. I should see if the original tokenizer was tokenizing it differently.

KTibow avatar Apr 21 '24 13:04 KTibow

Try "graph". Changelogs in the results are really annoying. Graph Component by itself doesn't appear in the list.

Maybe it would worth to take ref instances into account separately.

nagyrobi avatar Apr 21 '24 15:04 nagyrobi

Because the graph component isn't its own page, it doesn't show up as a separate result in either this search or the default Sphinx search. (It would be nice if the subheaders like "light effects", "graph component", "pin" showed up, but that would probably require hardcoding or something) You are using the embedding search (the one that shows while you're typing) though, right? I can't get changelogs to show up by searching for "graph" with it.

Maybe it would worth to take ref instances into account separately.

That sounds interesting, but I'm not sure what that would look like in practice. Could you elaborate?

KTibow avatar Apr 21 '24 15:04 KTibow

I just type in what I search for, and if there's nothing relevant showing up while typing, I press the button.

I would expect to have a more detailed set of relevant results, without changelog entries, etc., based on similar criteria as the popups I got while typing.

In my mind the two are one.

nagyrobi avatar Apr 21 '24 16:04 nagyrobi

I understand where you're coming from, you would expect them to act similarly. However, this PR is just an incremental change, only adding a search while typing; the more detailed search is exactly the same to the one currently on https://esphome.io/. I could hook into Sphinx's search logic, and add some custom filtering and sorting, but I would prefer to just edit the search while typing.

KTibow avatar Apr 21 '24 16:04 KTibow

I understand that. But users will see is holistically, it would really help to have a consistent experience.

nagyrobi avatar Apr 21 '24 17:04 nagyrobi

I think anything to improve searching esphome would be beneficial. With this PR the search doesn’t appear to produce the results I’d expect. With a search that is based on the title then I’d expect “abp” (either lowercase or caps) to show the Honeywell ABP and ABP2 components. Yet it shows neither one. “Honeywell” finds both. “Pressure” finds neither and only lists one component with the word “pressure” in the title.

RubyBailey avatar Apr 22 '24 01:04 RubyBailey

Looks like if I increase the embedding dimension to 50 the performance gets better. I'll have to see how I can increase the dimensions while not shipping too many word embeddings.

KTibow avatar Apr 22 '24 01:04 KTibow

It’s somewhat better, yet “Pressure” still doesn’t show 3 components whose titles include the word pressure.

RubyBailey avatar Apr 22 '24 16:04 RubyBailey

@RubyBailey for some reason one component is using "co_" (which we have no embeddings for) instead of "co", making it embed badly. optimally it would just use "co", but i just turned the "co_" token into the "co" token, which fixes the problem

KTibow avatar Apr 22 '24 23:04 KTibow