Goutte icon indicating copy to clipboard operation
Goutte copied to clipboard

<!DOCTYPE html> breaks scraping

Open benjivm opened this issue 8 years ago • 5 comments

Hey there, I'm attempting to scrape IMDb, but the very first line of their site breaks the scraper: <!DOCTYPE html>

Saving the file locally and removing that one line fixes the issue and Goutte returns the data I want.

Update Well, I guess it's not that tag per se because many other sites that declare identically work just fine, but for some reason removing it makes IMDb work. So, there's something elsewhere that the site is doing that breaks Goutte, but I have no idea what it could be.

Update 2 Ok, so I've traced to the following: removing the DOCTYPE declaration allows the entire page to load, which means scraping works. Adding it prevents Goutte from fetching anymore content outside of <head>. When I look at the loaded HTML with the proper HTML and DOCTYPE only anything within the <head> tags appears, making </head> the end of the file.

Update 3 Found the culprit: IMDb uses two javascript ads that completely overwrite the document's head property depending on the ad being loaded. Therefore there are 3 doctype declarations on the page which breaks Goutte. Removing the

benjivm avatar Sep 14 '17 00:09 benjivm

Hi, i have met the same issue. Have you found the solution for this issue?

KOFLazycat avatar Dec 20 '17 09:12 KOFLazycat

@KOFLazycat Unfortunately not.

benjivm avatar Dec 20 '17 16:12 benjivm

Goutte uses the Symfony BrowserKit component, which itself relies on PHP's features to parse the HTML (and PHP itself relies on libxml). So if PHP cannot parse the HTML source of this page, there is nothing you can really do in Goutte.

stof avatar Dec 20 '17 17:12 stof

@stof Do you know of a way to force BrowserKit to ignore everything inside <script> tags?

benjivm avatar Dec 20 '17 17:12 benjivm

@gmask It's OK for DiDOM. If you just want to scrape imdb, you can try it.

KOFLazycat avatar Dec 21 '17 06:12 KOFLazycat

For those facing this issue I suggest using RoachPHP with the ExecuteJavascriptMiddleware, which requires browsershot but works well.

benjivm avatar Sep 15 '22 21:09 benjivm