Prioritise Specific URLs in the Configs JSON

Open praneesha opened this issue 2 years ago • 15 comments

We have implemented the Docearch V3. We need to prioritize a URL (i.e., https://lib.ballerina.io/ballerina/grpc/latest/**)for a particular search term (i.e., grpc) and we have edited the JSON config in the new Crawler web interface as follows.

 {
      indexName: "ballerina",
      pathsToMatch: ["https://lib.ballerina.io/ballerina/grpc/latest/**"],
      recordExtractor: ({ $, helpers }) => {
        return helpers.docsearch({
          recordProps: {
            lvl1: ".content h1",
            content: ".content p, .content li",
            lvl0: {
              selectors: "",
              defaultValue: "Ballerina gRPC",
            },
            lvl2: ".content h2",
            lvl3: ".content h3",
            lvl4: ".content h4",
            lvl5: ".content h5,.content h6",
            site: {
              defaultValue: ["ballerina_api_docs_grpc"],
            },
            tags: {
              defaultValue: ["ballerina_api_docs_grpc"],
            },
            pageRank: "4",
          },
          indexHeadings: true,
        });
      },
    },

Although we scheduled and ran a manual crawl, the search results haven't got updated. What are we missing here?

Do you want to request a feature or report a bug?

If it is a DocSearch index issue, what is the related `index_name` ?

index_name=

What is the current behaviour?

If the current behaviour is a bug, please provide all the steps to reproduce and screenshots with context.

What is the expected behaviour?

What have you tried to solve it?

Any quick clues?

Any other feedback / questions ?

Jan 03 '22 12:01 praneesha

Hey,

Is there an other action with a broader pathsToMatch that will match https://lib.ballerina.io/ballerina/grpc/latest/?

If so, the URL will be crawled in both actions, and might override the results. You can exclude the url by adding !https://lib.ballerina.io/ballerina/grpc/latest/** to the pathsToMatch of that other action

Jan 03 '22 12:01 shortcuts

@shortcuts - Thanks a lot for the quick response.

Yes, we do have another action, which uses this URL. I excluded it as follows and re-ran the crawler.

 indexName: "ballerina",
      pathsToMatch: [
        "https://lib.ballerina.io**/**",
        "!https://lib.ballerina.io/ballerina/grpc/latest/**",
      ],

However, the search results are still not updated as expected. Anything else that we need to do?

Jan 03 '22 13:01 praneesha

Looking at your index, records are populated with the correct weight and pageRank, maybe you need to provide an higher value?

Jan 03 '22 13:01 shortcuts

@shortcuts - I think we will have to totally exclude the old versions of a particular URL to stop getting them in the search results?

For example,

https://lib.ballerina.io/ballerina/http/latest/** should be included
https://lib.ballerina.io/ballerina/http/2.0.1/** should be excluded
https://lib.ballerina.io/ballerina/http/2.0.0/** should be excluded

Do we have to exclude the old versions by adding an entry for each of them as follows?

 indexName: "ballerina",
      pathsToMatch: [
        "https://lib.ballerina.io/ballerina/http/latest/**",
        "!https://lib.ballerina.io/ballerina/http/2.0.1/**",
        "!https://lib.ballerina.io/ballerina/http/2.0.0/**",
      ],

Jan 04 '22 07:01 praneesha

Exactly, as long as an URL will match a pathsToMatch, records will be created! You can define URLs you don't want to crawl as exclusionPatterns

Jan 04 '22 08:01 shortcuts

@shortcuts - Now, we have updated the exclusionPatterns as follows.

  startUrls: [
    "https://ballerina.io/",
    "https://lib.ballerina.io/",
    "https://blog.ballerina.io/",
    "https://central.ballerina.io/",
  ],
  renderJavaScript: false,
  sitemaps: ["https://ballerina.io/sitemap.xml"],
  exclusionPatterns: [
    "https://lib.ballerina.io/ballerina/http/2.0.1/**",
    "https://lib.ballerina.io/ballerina/http/2.0.0/**",
    "https://lib.ballerina.io/ballerina/http/1.1.0-beta.2/**",
    "https://lib.ballerina.io/ballerina/http/1.1.0-beta.1/**",
    "https://lib.ballerina.io/ballerina/http/1.1.0-alpha8/**",
  ],

However, we still get these excluded URLs in the search results as shown below. Anything we have missed here?

Screenshot 2022-01-04 at 16 17 26

Jan 04 '22 10:01 praneesha

You can test them directly in the URL tester (crawler -> editor -> right side tab URL tester) to see if they are excluded. If not, they are crawled.

What are the URLs?

Jan 04 '22 10:01 shortcuts

@shortcuts - I have deprioritised the URL https://ballerina.io**/** than the https://lib.ballerina.io/ballerina/grpc/latest/**.

However, still results from the URL https://ballerina.io**/** are still apprearin on the search results for the search term grpcand the results from the prioritised https://lib.ballerina.io/ballerina/grpc/latest/** are not appearing at all as shown below.

Screenshot 2022-01-07 at 12 55 49

It says the URL is ignored when tested as shown below.

Screenshot 2022-01-07 at 12 57 33

What is wrong here?

Jan 07 '22 07:01 praneesha

The ranking seems fine on your screenshot, pages with a rank 4 are higher than pages with a rank 1. An higher page rank will place results before pages with a lower/no page rank, see https://docsearch.algolia.com/docs/record-extractor#boosting-search-results-with-pagerank

It says the URL is ignored when tested as shown below.

You passed the ** glob so it says 404, since the URL does not exist on your website.

Jan 07 '22 08:01 shortcuts

@shortcuts - The ** was used as a wildcard to crawl all the URLs that come with elements appended after .../latest/ in https://lib.ballerina.io/ballerina/grpc/latest/. For example, https://lib.ballerina.io/ballerina/grpc/latest/clients/Caller.

Is it incorrect to use the glob like that? If so how to crawl content in all those nested URLs?

However, the URL is still ignored when tested even though I removed the glob as shown below.

Screenshot 2022-01-10 at 11 03 13

Jan 10 '22 05:01 praneesha

@shortcuts - Any update on the above?

Jan 11 '22 07:01 praneesha

Is it incorrect to use the glob like that? If so how to crawl content in all those nested URLs?

Hey, this is indeed correct for the config, but the URL tester should have a direct URL.

(For the screenshot, redirect means that the URL found is redirecting to an other one, so we skipped the crawl)

If there's any URLs that are not crawled, you should check the Monitoring section to see the reason. This FAQ could also help you!

Jan 11 '22 08:01 shortcuts

@shortcuts - Thanks for the response. So, does that mean we cannot crawl URLs that are being redirected to another? In that case, do we need to give the original (redirected) URL in the config?

Jan 11 '22 08:01 praneesha

@shortcuts - I tried the direct URL, which does not have any redirection associated with it but still that also gets ignored.

We do have URLs like this on the website, which matches this pattern and I am not sure why they get ignored in the crawl.

https://lib.ballerina.io/ballerina/grpc/1.1.1/enums/CertValidationType

Screenshot 2022-01-12 at 15 57 44

Can you please help to figure the reason out?

Jan 12 '22 10:01 praneesha

So, does that mean we cannot crawl URLs that are being redirected to another?

If we find both URLs, we will only crawl the one that does not redirect.

tried the direct URL, which does not have any redirection associated with it but still that also gets ignored.

As per https://github.com/algolia/docsearch-configs/issues/5000#issuecomment-1009692137, the URL tester only accept direct URLs, which means you can't use globs in it. globs are used in the config for the pathsToMatch, etc.

So if you try with the direct URL, you will see this (see screenshot), which means that you have multiple actions/pathsToMatch matching this URL, which creates duplicate records (L37, L94). You need to use negative patterns (see L95) to avoid this issue.

Screenshot 2022-01-12 at 11 33 41

Jan 12 '22 10:01 shortcuts

docsearch-configs docsearch-configs copied to clipboard

Prioritise Specific URLs in the Configs JSON

Do you want to request a feature or report a bug?

If it is a DocSearch index issue, what is the related index_name ?

What is the current behaviour?

What is the expected behaviour?

What have you tried to solve it?

Any quick clues?

Any other feedback / questions ?

docsearch-configs
docsearch-configs copied to clipboard

If it is a DocSearch index issue, what is the related `index_name` ?