docsearch
docsearch copied to clipboard
Crawler: lvl0 is missing
Description
I found out that crawler doesn't create a record with a type: lvl0
. The first record (as seen in the picture) is the paragraph that follows it and the record with the lvl0 is missing. Yet it is visible in the hierarchy
key, ie. "lvl0": "Composer Usage Tips"
:
(crawled page is https://doc.nette.org/en/best-practices/composer)
Why is this a problem?
Because if there is no lvl0 record, the default setting of searchableAttributes prevents searching by title (hierarchy_radio_camel.lvl0 and hierarchy_radio.lvl0 are always null):
Without 'radio'
If I replace radio in the attributes with non-radio attributes, searching by the top title works, but the same page is returned as many times as there are other lower titles in it, and no more pages are returned:
I think we have the same problem on https://docs.avax.network. For example, searching command line interface
(https://docs.avax.network/search?q=command%20line%20interface) does not return this article https://docs.avax.network/build/references/command-line-interface as part of the result in which command line interface
is the first section header.
I think we had the same issue at https://docs.gitlab.com, and I ended up duplicating the selectors of lvl0
and lvl1
:
lvl0: {
selectors: ".article-content h1",
defaultValue: "Documentation",
},
lvl1: ".article-content h1",
lvl2: ".article-content h2",
lvl3: ".article-content h3",
lvl4: ".article-content h4",
lvl5: ".article-content h5, .article-content td:first-child",
content:
".article-content p, .article-content li, .article-content td:last-child, .article-content pre.highlight code",
Hey there
I found out that crawler doesn't create a record with a type: lvl0
@dg We decided not to create records for the lvl0
only as it creates a lot of duplicate results for DocSearch v3 (but makes the search less optimal in DocSearch v2). You can do some test using the Search Preview
in your crawler -> editor -> right side of the screen and see the search results with DocSearch v3, which are much better than v2.
We will of course do more investigations with lvl0
records to see if we should revisit our decision.
I think we have the same problem on https://docs.avax.network/. For example, searching command line interface
@yulin-dong Docusaurus v2 defaults h1
in header h1
, which is what we use to extract the h1
(lvl0
) of your website, while in your DOM there is no header
element. You can update your lvl0
selector to article h1
and it will solve your issue.
I think we had the same issue at https://docs.gitlab.com/
Do you mean your results are less relevant without this duplicate selector?
Do you mean your results are less relevant without this duplicate selector?
@shortcuts I encountered this when we migrated to v3. H1s weren't showing in the results https://gitlab.com/gitlab-org/gitlab-docs/-/merge_requests/2321#note_831782470.
Hm I see, will make sure to increase the priority here.
if the need is there, it could be a simple option disabled by default 👍🏻
For a year I was unable to set up meaningful indexing , until today…
It wasn't until today that I came across the absolutely most important sentence:
I couldn't have figured it out sooner because COMPLETELY all the examples say the opposite:
So my main headers kept disappearing and this issue arose.
Please fix the examples and put this sentence in the Record Extractor documentation and also in Tips for a good search. It'll save the kitty :-) Thanks.
@shortcuts