scrape-schema-recipe icon indicating copy to clipboard operation
scrape-schema-recipe copied to clipboard

The content of some pages is not fully loaded

Open Nicopiwi opened this issue 3 years ago • 2 comments

https://github.com/micahcochran/scrape-schema-recipe/blob/73ea461ad0c1ed256e742a0f285eac36f3db15df/scrape_schema_recipe/scrape.py#L323

Take, for example, https://www.tudiscovery.com/foodnetwork/tiffani-te-invita/recetas/pollo-asado-con-glaseado-de-naranja This page, for example, renders its microformat elements after Javascript has been loaded.

I suggest implementing the loading of the page's content like this https://stackoverflow.com/questions/47730671/python-3-using-requests-does-not-get-the-full-content-of-a-web-page

Nicopiwi avatar Jan 05 '22 04:01 Nicopiwi

Thanks for reporting this.

I've seen several approaches to JavaScript protected content. Many of which I haven't had much success with. I've looked into this for https://github.com/hhursev/recipe-scrapers/issues/193 and https://github.com/hhursev/recipe-scrapers/issues/447

I can't get the example website to load on my computer, it directs me to https://www.discoveryenespanol.com/ I am in the US. I assume that it wants me to sign in to an account to get the content. Perhaps there is a sign on to get to the content. That could also be causing problems for scrape-schema-recipe. Since, I can't access the example page it will be difficult for me to fix this.

I also tried dryscrape, which I've had success with in the past on Javascript pages, on this example and it also did not work.

InvalidResponseError: {"class":"InvalidResponseError","message":"Unable to load URL: https://www.tudiscovery.com/foodnetwork/tiffani-te-invita/recetas/pollo-asado-con-glaseado-de-naranja because of error loading https://www.googletagmanager.com/ns.html?id=GTM-N8NH9FD&gtm_auth=&gtm_preview=&gtm_cookies_win=x: Unknown error"}

I suggest trying to using the requests.get function to download the example page and run it through Selenium or some other Javascript engine. scrape-schema-recipe will accept HTML from the scrape() and loads() functions. If you make progress or have issues, please feel free to post about it.

I would happily accept a code example in the README.md file of how to interpret recipes schema data that is hidden with Javascript.

The problem with adding selenium or another Javascript is a new dependency for everyone. I would only add accept a Pull Request that would add a Javascript engine, if it were an optional feature.

If this turns out it is just authentication related, I'd happily accept a PR that might expand what needs to be passed to requests.get to allow it to log in to websites (the auth parameter?).

At this point, I am pondering remove extruct as a dependency. extruct has more to it that what this library really needs.

micahcochran avatar Jan 06 '22 00:01 micahcochran

What would work best would be downloading the webpage and putting it through a JavaScript interpreting library as a string of HTML content. Passing that HTML string into the loads() function. If anyone figures out this workflow, please feel free to share so it can be documented.

requests-html seems promising.

micahcochran avatar Jul 22 '22 12:07 micahcochran