colly icon indicating copy to clipboard operation
colly copied to clipboard

handleOnXML tries to parse`.xlsx` files

Open theseanything opened this issue 10 months ago • 2 comments

The handleOnXML function attempts to parse responses with the content-type application/vnd.openxmlformats-officedocument.spreadsheetml.sheet. This is because the function looks for any mention of xml in the content type. This results in a parse error when xmlquery.Parse() is called (For example: `encoding/xml.SyntaxError {Msg: "illegal character code U+0003", Line: 1}).

XLSX files packaged as a zip - so can't be directly parsed as XML.

It would be ideal to not try and parse these files, possibly by being more explicit in which content-types we consider to be XML.

theseanything avatar Oct 19 '23 14:10 theseanything

This doesn't only effect xlsx, but also docx, pptx etc.. type documents

theseanything avatar Oct 20 '23 10:10 theseanything

To add to this it would be nice to able to have more granularity over what XML is parsed. For example, we use a OnXML handler to follow links in a XML sitemap, but our site contains many SVGs (image/svg+xml) and RFDs (application/rdf+xml) which also are unnecessarily parsed.

theseanything avatar Oct 20 '23 10:10 theseanything