Goutte icon indicating copy to clipboard operation
Goutte copied to clipboard

Enforce or detect UTF-8 encoding when 'charset' is not set

Open aleksblendwerk opened this issue 13 years ago • 10 comments

I am using Goutte to scrape a couple of sites and a few of them provide UTF-8 content but only set "text/html" as the Content-Type, thus making the DomCrawler assume it is ISO-8859-1 which results in double-encoded UTF-8 strings in the returned DOMDocument (and in the results for text() and so on).

Right now I am working around this by extending Goutte\Client and overriding createCrawlerFromContent, calling the parent method with ";charset=UTF-8" added to the type when there is no charset attribute. Probably not a really good way to do it, so I didn't want to make a pull request just yet.

My main point is that this took me quite a while to figure out and Goutte could probably be more convenient/save other new users from falling into the same trap by letting users specify an encoding. Besides that, thanks for a great library!

aleksblendwerk avatar May 31 '12 22:05 aleksblendwerk

+1 :)

mashpie avatar Jun 05 '12 19:06 mashpie

+1

olragon avatar Aug 12 '12 05:08 olragon

+1

akbortoli avatar Mar 05 '13 10:03 akbortoli

+1

neochief avatar Mar 18 '13 12:03 neochief

+1

RageZBla avatar Aug 20 '13 08:08 RageZBla

+1

abardan avatar Feb 23 '16 08:02 abardan

@aleksblendwerk, this can be what you're after:

  1. allow specifying default charset in Goutte
  2. allow specifying default charset in DomCrawler
  3. pass through default charset from Goutte to DomCrawler

P.S. Maybe Goutte/DomCrawler already can do that and I'm not aware how setting names for them are called.

aik099 avatar Feb 23 '16 09:02 aik099

+1

ux-engineer avatar Nov 28 '16 16:11 ux-engineer

+1

Oxicode avatar Nov 30 '16 16:11 Oxicode

+1

legshooter avatar May 17 '17 20:05 legshooter