crawlee
crawlee copied to clipboard
Trouble unzipping `sitemap.xml` (`zlib: incorrect header check`)
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/core
Issue description
gzip is not able to unzip some sitemaps properly (and the Sitemap.load() call ends with Malformed sitemap content error.
Code sample
import { Sitemap } from 'crawlee';
// loading the sitemap in browser works ok.
const sitemap = await Sitemap.load('https://www.paypal-community.com/sitemap.xml');
Package version
3.9.2
Node.js version
20
Operating system
Linux
Apify platform
- [X] Tick me if you encountered this issue on the Apify platform
I have tested this on the next release
No response
Other context
No response
Looks like the sitemap you reference points to another - https://www.paypal-community.com/sitemap_threads.xml.gz - which is not gzipped, despite its name. Why would anyone do this is beyond me. I guess we could add an option to Sitemap.load to override the type :shrug:... Trying to recover automatically seems futile to me.
EDIT: the server uses in-transit gzip encoding, which is however a different thing than what the .xml.gz extension is supposed to mean
Can't we detect this based on some initial bytes? Hard to trust extensions for anything these days :D
edit: something like this
In the modified code, we are checking for the GZip file format by comparing the first 2 bytes of the file with the expected byte sequence ({&H1F, &H8B}). GZip files typically start with these two bytes.
Can't we detect this based on some initial bytes? Hard to trust extensions for anything these days :D
It's gonna be tricky since we use streams, but feasible, I guess.
You can use file-type (example with streams here), in WCC it worked pretty well. You can even pipe through it 👀
More WCC users are complaining about this. Do we know how to approach this issue yet?