Set a more permissive `Accept:` header for sitemap loading
Some sitemaps (e.g. https://docs.superjoin.ai/sitemap.xml) return 404 for the current utils/sitemap value of Accept: HTTP header.
https://github.com/apify/crawlee/blob/615c8f9f691fab70d15be84c2ccff29daab4e55e/packages/utils/src/internals/sitemap.ts#L261
Investigate why this was required in the first place (maybe it actually wasn't). If it wasn't, switch the accept header to something more permissive (e.g. */*) or remove it completely.
Additional context:
- https://apify.slack.com/archives/C051BSK664C/p1762868702710649
- https://apify.slack.com/archives/C05683VTD6J/p1762891856864289
https://github.com/apify/crawlee/blob/615c8f9f691fab70d15be84c2ccff29daab4e55e/packages/utils/src/internals/sitemap.ts#L261
Investigate why this was required in the first place (maybe it actually wasn't). If it wasn't, switch the accept header to something more permissive (e.g.
*/*) or remove it completely.
Here's your answer 🙂 https://github.com/apify/crawlee/pull/2619#discussion_r1718356374
Thank you - I did find that, but it still left me wondering why we would add it in the first place (instead of just using got-scraping's default Accept).
Thank you - I did find that, but it still left me wondering why we would add it in the first place (instead of just using
got-scraping's defaultAccept).
I see. IMO even an empty Accept header should be fine. Noone in their right mind would block access to a sitemap based on browser fingerprinting. Then again, what the aforementioned server is doing is not exactly sane either.