docsearch Anchors are being stripped out (using `sitemaps`, `linkExtractor` and `externalData`)

Description

We are using Algolia Crawler UI for parsing our mixed static HTML & SPA website (using hash router). All URLs are provided in sitemaps Crawler config.

new Crawler({
  startUrls: [],
  sitemaps: ["https://example.com/sitemap.xml"],
  // ...
})

Steps to reproduce

Use a sitemap with the following content:

<!-- ... -->
<url>
  <loc>https://example.com/page.html</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<url>
  <loc>https://example.com/subpage.html#/foo</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<url>
  <loc>https://example.com/subpage.html#/bar</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<!-- ... -->

... or using the static linkExtractor:

new Crawler({
  // ...
  linkExtractor: () => {
    return [
      "https://example.com/page.html",
      "https://example.com/subpage.html#/foo",
      "https://example.com/subpage.html#/bar",
    ];
  },
  // ...
})

Then run the URL Tester.

Result:

LINKS
Found 2 links matching your configuration 
 - https://example.com/page.html
 - https://example.com/subpage.html

Expected behavior

Expected result:

LINKS
Found 3 links matching your configuration 
 - https://example.com/page.html
 - https://example.com/subpage.html#/foo
 - https://example.com/subpage.html#/bar

Note those are not section anchors. Those are actual pages, correctly parsed in URL Tester with the renderJavaScript: true option when passing the full URL with the anchor.

Environment

Algolia Crawler UI

Similar issues:

https://github.com/algolia/docsearch/issues/1282
https://github.com/algolia/docsearch/issues/1823
https://github.com/algolia/docsearch/issues/1009 (old infra)
https://github.com/algolia/docsearch/issues/53 (rly old)

Mar 21 '23 11:03 bojanrajh

Hey, thanks for opening the issue. https://github.com/algolia/docsearch/issues/1823 seems related.

I'll investigate if there's a way for us to differentiate hash routed pages from anchored sections

Mar 21 '23 11:03 shortcuts

Thank you for a quick response! Just for more clarity: we don't mind adding or implementing a custom linkExtractor or recordExtractor with custom set objectID. We just need those URLs to be accepted (crawling works as intended when manually running the crawl from the UI).

Mar 21 '23 11:03 bojanrajh

Hey @shortcuts, any news on this one?

Somehow related, I tried to provide anchored URLs to the Crawler with externalData: ['myCSV], as described in your docs, and those URLs were again stripped down to one.

Example CSV:

url;title;content
"https://example.com/subpage.html#/foo";"Foo";"Foo content"
"https://example.com/subpage.html#/bar";"Bar";"Bar content"

Single URL under Crawler admin > External Data: https://example.com/subpage.html

I would expect the same issue would appear with your API client (JS), but I've just successfully created 2 objects containing URLs with anchors in our demo app (free plan, app ID BZSKX72NEG). However, I was not able to create admin API key for our app (DOCSEARCH plan, app ID J1Y01X9HGM) because the "All API Keys" section/tab is missing. By using the Admin API key I received error 400 - Not enough rights to update an object near line:1.

So, technically, my wild guess would be your system supports anchored URLs, they are just not supported by the crawler?

Mar 29 '23 14:03 bojanrajh

Hey @shortcuts, and news about this one?

Oct 16 '23 08:10 bojanrajh

docsearch docsearch copied to clipboard

Anchors are being stripped out (using `sitemaps`, `linkExtractor` and `externalData`)

Description

Steps to reproduce

Expected behavior

Environment

docsearch
docsearch copied to clipboard