Goutte
Goutte copied to clipboard
Enforce or detect UTF-8 encoding when 'charset' is not set
I am using Goutte to scrape a couple of sites and a few of them provide UTF-8 content but only set "text/html" as the Content-Type, thus making the DomCrawler assume it is ISO-8859-1 which results in double-encoded UTF-8 strings in the returned DOMDocument (and in the results for text() and so on).
Right now I am working around this by extending Goutte\Client and overriding createCrawlerFromContent, calling the parent method with ";charset=UTF-8" added to the type when there is no charset attribute. Probably not a really good way to do it, so I didn't want to make a pull request just yet.
My main point is that this took me quite a while to figure out and Goutte could probably be more convenient/save other new users from falling into the same trap by letting users specify an encoding. Besides that, thanks for a great library!
+1 :)
+1
+1
+1
+1
+1
@aleksblendwerk, this can be what you're after:
- allow specifying default charset in Goutte
- allow specifying default charset in DomCrawler
- pass through default charset from Goutte to DomCrawler
P.S. Maybe Goutte/DomCrawler already can do that and I'm not aware how setting names for them are called.
+1
+1
+1