Goutte
Goutte copied to clipboard
<!DOCTYPE html> breaks scraping
Hey there, I'm attempting to scrape IMDb, but the very first line of their site breaks the scraper: <!DOCTYPE html>
Saving the file locally and removing that one line fixes the issue and Goutte returns the data I want.
Update Well, I guess it's not that tag per se because many other sites that declare identically work just fine, but for some reason removing it makes IMDb work. So, there's something elsewhere that the site is doing that breaks Goutte, but I have no idea what it could be.
Update 2
Ok, so I've traced to the following: removing the DOCTYPE declaration allows the entire page to load, which means scraping works. Adding it prevents Goutte from fetching anymore content outside of <head>. When I look at the loaded HTML with the proper HTML and DOCTYPE only anything within the <head> tags appears, making </head> the end of the file.
Update 3 Found the culprit: IMDb uses two javascript ads that completely overwrite the document's head property depending on the ad being loaded. Therefore there are 3 doctype declarations on the page which breaks Goutte. Removing the
Hi, i have met the same issue. Have you found the solution for this issue?
@KOFLazycat Unfortunately not.
Goutte uses the Symfony BrowserKit component, which itself relies on PHP's features to parse the HTML (and PHP itself relies on libxml). So if PHP cannot parse the HTML source of this page, there is nothing you can really do in Goutte.
@stof Do you know of a way to force BrowserKit to ignore everything inside <script> tags?
@gmask It's OK for DiDOM. If you just want to scrape imdb, you can try it.
For those facing this issue I suggest using RoachPHP with the ExecuteJavascriptMiddleware, which requires browsershot but works well.