docsearch-configs
docsearch-configs copied to clipboard
Prioritise Specific URLs in the Configs JSON
We have implemented the Docearch V3. We need to prioritize a URL (i.e., https://lib.ballerina.io/ballerina/grpc/latest/**)for a particular search term (i.e., grpc) and we have edited the JSON config in the new Crawler web interface as follows.
{
indexName: "ballerina",
pathsToMatch: ["https://lib.ballerina.io/ballerina/grpc/latest/**"],
recordExtractor: ({ $, helpers }) => {
return helpers.docsearch({
recordProps: {
lvl1: ".content h1",
content: ".content p, .content li",
lvl0: {
selectors: "",
defaultValue: "Ballerina gRPC",
},
lvl2: ".content h2",
lvl3: ".content h3",
lvl4: ".content h4",
lvl5: ".content h5,.content h6",
site: {
defaultValue: ["ballerina_api_docs_grpc"],
},
tags: {
defaultValue: ["ballerina_api_docs_grpc"],
},
pageRank: "4",
},
indexHeadings: true,
});
},
},
Although we scheduled and ran a manual crawl, the search results haven't got updated. What are we missing here?
Do you want to request a feature or report a bug?
If it is a DocSearch index issue, what is the related index_name ?
index_name=
What is the current behaviour?
If the current behaviour is a bug, please provide all the steps to reproduce and screenshots with context.
What is the expected behaviour?
What have you tried to solve it?
Any quick clues?
Any other feedback / questions ?
Hey,
Is there an other action with a broader pathsToMatch that will match https://lib.ballerina.io/ballerina/grpc/latest/?
If so, the URL will be crawled in both actions, and might override the results. You can exclude the url by adding !https://lib.ballerina.io/ballerina/grpc/latest/** to the pathsToMatch of that other action
@shortcuts - Thanks a lot for the quick response.
Yes, we do have another action, which uses this URL. I excluded it as follows and re-ran the crawler.
indexName: "ballerina",
pathsToMatch: [
"https://lib.ballerina.io**/**",
"!https://lib.ballerina.io/ballerina/grpc/latest/**",
],
However, the search results are still not updated as expected. Anything else that we need to do?
Looking at your index, records are populated with the correct weight and pageRank, maybe you need to provide an higher value?
@shortcuts - I think we will have to totally exclude the old versions of a particular URL to stop getting them in the search results?
For example,
https://lib.ballerina.io/ballerina/http/latest/**should be includedhttps://lib.ballerina.io/ballerina/http/2.0.1/**should be excludedhttps://lib.ballerina.io/ballerina/http/2.0.0/**should be excluded
Do we have to exclude the old versions by adding an entry for each of them as follows?
indexName: "ballerina",
pathsToMatch: [
"https://lib.ballerina.io/ballerina/http/latest/**",
"!https://lib.ballerina.io/ballerina/http/2.0.1/**",
"!https://lib.ballerina.io/ballerina/http/2.0.0/**",
],
Exactly, as long as an URL will match a pathsToMatch, records will be created! You can define URLs you don't want to crawl as exclusionPatterns
@shortcuts - Now, we have updated the exclusionPatterns as follows.
startUrls: [
"https://ballerina.io/",
"https://lib.ballerina.io/",
"https://blog.ballerina.io/",
"https://central.ballerina.io/",
],
renderJavaScript: false,
sitemaps: ["https://ballerina.io/sitemap.xml"],
exclusionPatterns: [
"https://lib.ballerina.io/ballerina/http/2.0.1/**",
"https://lib.ballerina.io/ballerina/http/2.0.0/**",
"https://lib.ballerina.io/ballerina/http/1.1.0-beta.2/**",
"https://lib.ballerina.io/ballerina/http/1.1.0-beta.1/**",
"https://lib.ballerina.io/ballerina/http/1.1.0-alpha8/**",
],
However, we still get these excluded URLs in the search results as shown below. Anything we have missed here?

You can test them directly in the URL tester (crawler -> editor -> right side tab URL tester) to see if they are excluded. If not, they are crawled.
What are the URLs?
@shortcuts - I have deprioritised the URL https://ballerina.io**/** than the https://lib.ballerina.io/ballerina/grpc/latest/**.
However, still results from the URL https://ballerina.io**/** are still apprearin on the search results for the search term grpcand the results from the prioritised https://lib.ballerina.io/ballerina/grpc/latest/** are not appearing at all as shown below.

It says the URL is ignored when tested as shown below.

What is wrong here?
The ranking seems fine on your screenshot, pages with a rank 4 are higher than pages with a rank 1. An higher page rank will place results before pages with a lower/no page rank, see https://docsearch.algolia.com/docs/record-extractor#boosting-search-results-with-pagerank
It says the URL is ignored when tested as shown below.
You passed the ** glob so it says 404, since the URL does not exist on your website.
@shortcuts - The ** was used as a wildcard to crawl all the URLs that come with elements appended after .../latest/ in https://lib.ballerina.io/ballerina/grpc/latest/. For example, https://lib.ballerina.io/ballerina/grpc/latest/clients/Caller.
Is it incorrect to use the glob like that? If so how to crawl content in all those nested URLs?
However, the URL is still ignored when tested even though I removed the glob as shown below.

@shortcuts - Any update on the above?
Is it incorrect to use the glob like that? If so how to crawl content in all those nested URLs?
Hey, this is indeed correct for the config, but the URL tester should have a direct URL.
(For the screenshot, redirect means that the URL found is redirecting to an other one, so we skipped the crawl)
If there's any URLs that are not crawled, you should check the Monitoring section to see the reason. This FAQ could also help you!
@shortcuts - Thanks for the response. So, does that mean we cannot crawl URLs that are being redirected to another? In that case, do we need to give the original (redirected) URL in the config?
@shortcuts - I tried the direct URL, which does not have any redirection associated with it but still that also gets ignored.
We do have URLs like this on the website, which matches this pattern and I am not sure why they get ignored in the crawl.
https://lib.ballerina.io/ballerina/grpc/1.1.1/enums/CertValidationType

Can you please help to figure the reason out?
So, does that mean we cannot crawl URLs that are being redirected to another?
If we find both URLs, we will only crawl the one that does not redirect.
tried the direct URL, which does not have any redirection associated with it but still that also gets ignored.
As per https://github.com/algolia/docsearch-configs/issues/5000#issuecomment-1009692137, the URL tester only accept direct URLs, which means you can't use globs in it. globs are used in the config for the pathsToMatch, etc.
So if you try with the direct URL, you will see this (see screenshot), which means that you have multiple actions/pathsToMatch matching this URL, which creates duplicate records (L37, L94). You need to use negative patterns (see L95) to avoid this issue.
