crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Set a more permissive `Accept:` header for sitemap loading

Open barjin opened this issue 1 month ago • 3 comments

Some sitemaps (e.g. https://docs.superjoin.ai/sitemap.xml) return 404 for the current utils/sitemap value of Accept: HTTP header.

https://github.com/apify/crawlee/blob/615c8f9f691fab70d15be84c2ccff29daab4e55e/packages/utils/src/internals/sitemap.ts#L261

Investigate why this was required in the first place (maybe it actually wasn't). If it wasn't, switch the accept header to something more permissive (e.g. */*) or remove it completely.

Additional context:

  • https://apify.slack.com/archives/C051BSK664C/p1762868702710649
  • https://apify.slack.com/archives/C05683VTD6J/p1762891856864289

barjin avatar Nov 12 '25 09:11 barjin

https://github.com/apify/crawlee/blob/615c8f9f691fab70d15be84c2ccff29daab4e55e/packages/utils/src/internals/sitemap.ts#L261

Investigate why this was required in the first place (maybe it actually wasn't). If it wasn't, switch the accept header to something more permissive (e.g. */*) or remove it completely.

Here's your answer 🙂 https://github.com/apify/crawlee/pull/2619#discussion_r1718356374

janbuchar avatar Nov 12 '25 10:11 janbuchar

Thank you - I did find that, but it still left me wondering why we would add it in the first place (instead of just using got-scraping's default Accept).

barjin avatar Nov 12 '25 10:11 barjin

Thank you - I did find that, but it still left me wondering why we would add it in the first place (instead of just using got-scraping's default Accept).

I see. IMO even an empty Accept header should be fine. Noone in their right mind would block access to a sitemap based on browser fingerprinting. Then again, what the aforementioned server is doing is not exactly sane either.

janbuchar avatar Nov 12 '25 10:11 janbuchar