crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Trouble unzipping `sitemap.xml` (`zlib: incorrect header check`)

Open barjin opened this issue 1 year ago • 5 comments
trafficstars

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/core

Issue description

gzip is not able to unzip some sitemaps properly (and the Sitemap.load() call ends with Malformed sitemap content error.

Code sample

import { Sitemap } from 'crawlee';

// loading the sitemap in browser works ok.
const sitemap = await Sitemap.load('https://www.paypal-community.com/sitemap.xml');

Package version

3.9.2

Node.js version

20

Operating system

Linux

Apify platform

  • [X] Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

barjin avatar May 14 '24 09:05 barjin

Looks like the sitemap you reference points to another - https://www.paypal-community.com/sitemap_threads.xml.gz - which is not gzipped, despite its name. Why would anyone do this is beyond me. I guess we could add an option to Sitemap.load to override the type :shrug:... Trying to recover automatically seems futile to me.

EDIT: the server uses in-transit gzip encoding, which is however a different thing than what the .xml.gz extension is supposed to mean

janbuchar avatar May 14 '24 09:05 janbuchar

Can't we detect this based on some initial bytes? Hard to trust extensions for anything these days :D

edit: something like this

In the modified code, we are checking for the GZip file format by comparing the first 2 bytes of the file with the expected byte sequence ({&H1F, &H8B}). GZip files typically start with these two bytes.

B4nan avatar May 14 '24 09:05 B4nan

Can't we detect this based on some initial bytes? Hard to trust extensions for anything these days :D

It's gonna be tricky since we use streams, but feasible, I guess.

janbuchar avatar May 14 '24 09:05 janbuchar

You can use file-type (example with streams here), in WCC it worked pretty well. You can even pipe through it 👀

barjin avatar May 14 '24 09:05 barjin

obrazek

More WCC users are complaining about this. Do we know how to approach this issue yet?

barjin avatar May 20 '24 11:05 barjin