docsearch
docsearch copied to clipboard
Anchors are being stripped out (using `sitemaps`, `linkExtractor` and `externalData`)
Description
We are using Algolia Crawler UI for parsing our mixed static HTML & SPA website (using hash router). All URLs are provided in sitemaps Crawler config.
new Crawler({
startUrls: [],
sitemaps: ["https://example.com/sitemap.xml"],
// ...
})
Steps to reproduce
Use a sitemap with the following content:
<!-- ... -->
<url>
<loc>https://example.com/page.html</loc>
<changefreq>monthly</changefreq>
<priority>0.6</priority>
</url>
<url>
<loc>https://example.com/subpage.html#/foo</loc>
<changefreq>monthly</changefreq>
<priority>0.6</priority>
</url>
<url>
<loc>https://example.com/subpage.html#/bar</loc>
<changefreq>monthly</changefreq>
<priority>0.6</priority>
</url>
<!-- ... -->
... or using the static linkExtractor:
new Crawler({
// ...
linkExtractor: () => {
return [
"https://example.com/page.html",
"https://example.com/subpage.html#/foo",
"https://example.com/subpage.html#/bar",
];
},
// ...
})
Then run the URL Tester.
Result:
LINKS
Found 2 links matching your configuration
- https://example.com/page.html
- https://example.com/subpage.html
Expected behavior
Expected result:
LINKS
Found 3 links matching your configuration
- https://example.com/page.html
- https://example.com/subpage.html#/foo
- https://example.com/subpage.html#/bar
Note those are not section anchors. Those are actual pages, correctly parsed in URL Tester with the renderJavaScript: true option when passing the full URL with the anchor.
Environment
- Algolia Crawler UI
Similar issues:
- https://github.com/algolia/docsearch/issues/1282
- https://github.com/algolia/docsearch/issues/1823
- https://github.com/algolia/docsearch/issues/1009 (old infra)
- https://github.com/algolia/docsearch/issues/53 (rly old)
Hey, thanks for opening the issue. https://github.com/algolia/docsearch/issues/1823 seems related.
I'll investigate if there's a way for us to differentiate hash routed pages from anchored sections
Thank you for a quick response!
Just for more clarity: we don't mind adding or implementing a custom linkExtractor or recordExtractor with custom set objectID. We just need those URLs to be accepted (crawling works as intended when manually running the crawl from the UI).
Hey @shortcuts, any news on this one?
Somehow related, I tried to provide anchored URLs to the Crawler with externalData: ['myCSV], as described in your docs, and those URLs were again stripped down to one.
Example CSV:
url;title;content
"https://example.com/subpage.html#/foo";"Foo";"Foo content"
"https://example.com/subpage.html#/bar";"Bar";"Bar content"
Single URL under Crawler admin > External Data: https://example.com/subpage.html
I would expect the same issue would appear with your API client (JS), but I've just successfully created 2 objects containing URLs with anchors in our demo app (free plan, app ID BZSKX72NEG). However, I was not able to create admin API key for our app (DOCSEARCH plan, app ID J1Y01X9HGM) because the "All API Keys" section/tab is missing. By using the Admin API key I received error 400 - Not enough rights to update an object near line:1.
So, technically, my wild guess would be your system supports anchored URLs, they are just not supported by the crawler?
Hey @shortcuts, and news about this one?