docsearch icon indicating copy to clipboard operation
docsearch copied to clipboard

Crawler: lvl0 is missing

Open dg opened this issue 2 years ago • 6 comments

Description

I found out that crawler doesn't create a record with a type: lvl0. The first record (as seen in the picture) is the paragraph that follows it and the record with the lvl0 is missing. Yet it is visible in the hierarchy key, ie. "lvl0": "Composer Usage Tips":

(crawled page is https://doc.nette.org/en/best-practices/composer)

image

Why is this a problem?

Because if there is no lvl0 record, the default setting of searchableAttributes prevents searching by title (hierarchy_radio_camel.lvl0 and hierarchy_radio.lvl0 are always null):

image

Without 'radio'

If I replace radio in the attributes with non-radio attributes, searching by the top title works, but the same page is returned as many times as there are other lower titles in it, and no more pages are returned:

image

dg avatar Jan 09 '22 22:01 dg

I think we have the same problem on https://docs.avax.network. For example, searching command line interface (https://docs.avax.network/search?q=command%20line%20interface) does not return this article https://docs.avax.network/build/references/command-line-interface as part of the result in which command line interface is the first section header.

yulin-dong avatar Feb 17 '22 00:02 yulin-dong

I think we had the same issue at https://docs.gitlab.com, and I ended up duplicating the selectors of lvl0 and lvl1:

            lvl0: {
              selectors: ".article-content h1",
              defaultValue: "Documentation",
            },
            lvl1: ".article-content h1",
            lvl2: ".article-content h2",
            lvl3: ".article-content h3",
            lvl4: ".article-content h4",
            lvl5: ".article-content h5, .article-content td:first-child",
            content:
              ".article-content p, .article-content li, .article-content td:last-child, .article-content pre.highlight code",

axilleas avatar Feb 17 '22 12:02 axilleas

Hey there

I found out that crawler doesn't create a record with a type: lvl0

@dg We decided not to create records for the lvl0 only as it creates a lot of duplicate results for DocSearch v3 (but makes the search less optimal in DocSearch v2). You can do some test using the Search Preview in your crawler -> editor -> right side of the screen and see the search results with DocSearch v3, which are much better than v2.

We will of course do more investigations with lvl0 records to see if we should revisit our decision.

I think we have the same problem on https://docs.avax.network/. For example, searching command line interface

@yulin-dong Docusaurus v2 defaults h1 in header h1, which is what we use to extract the h1 (lvl0) of your website, while in your DOM there is no header element. You can update your lvl0 selector to article h1 and it will solve your issue.

I think we had the same issue at https://docs.gitlab.com/

Do you mean your results are less relevant without this duplicate selector?

shortcuts avatar Feb 17 '22 12:02 shortcuts

Do you mean your results are less relevant without this duplicate selector?

@shortcuts I encountered this when we migrated to v3. H1s weren't showing in the results https://gitlab.com/gitlab-org/gitlab-docs/-/merge_requests/2321#note_831782470.

axilleas avatar Feb 17 '22 13:02 axilleas

Hm I see, will make sure to increase the priority here.

shortcuts avatar Feb 17 '22 13:02 shortcuts

if the need is there, it could be a simple option disabled by default 👍🏻

bodinsamuel avatar Feb 17 '22 16:02 bodinsamuel

For a year I was unable to set up meaningful indexing , until today…

It wasn't until today that I came across the absolutely most important sentence: image

I couldn't have figured it out sooner because COMPLETELY all the examples say the opposite:

image

So my main headers kept disappearing and this issue arose.

Please fix the examples and put this sentence in the Record Extractor documentation and also in Tips for a good search. It'll save the kitty :-) Thanks.

@shortcuts

dg avatar Feb 07 '23 17:02 dg