requests-html icon indicating copy to clipboard operation
requests-html copied to clipboard

Can't render JavaScript in requests-html / Can't run multithreading in Pyppeteer

Open chipswithdrips opened this issue 5 years ago • 13 comments

Hi,

I'm trying to render JavaScript from webpages, but requests-html fails every time to do it.

This is my code: from requests_html import HTMLSession s = HTMLSession() r = s.get('https://httpbin.org') r.html.render() print(r.html.html)

Some important points to make: -Searching with CTRL+F in the output for the right version that's displayed when rendering the JavaScript; version 0.9.2 is for non-javascript, while 0.9.3 is for javascript - it always shows 0.9.2 -Searching the keyword "cookie" (it displays "0 matches" even when typing only "cook") doesn't show anything because that keyword is displayed when rendering the JavaScript

It prints out the only HTML code before executing the JavaScript. I've tried to put a bigger timeout to render: r.html.render(timeout=60)

But it still waits the default 8 seconds.

When trying to put: r.html.render(sleep=60)

It waits for those 60 seconds and then it doesn't do anything; more than that, it says that the connection's been lost.

I thought that maybe it didn't render the JavaScript because it didn't have any type of headers so I've added the Chrome's ones (I've tried with user-agent only & then with all headers displayed in the network tab from Chrome when accessing httpbin.org), but still with no success.

I've tried to render the JavaScript with Pyppeteer which is included in the requests-html library and it can render the JavaScript (I don't understand why since it's included in the requests-html library); the only downside of this is that I've to scrape lots of links, but I couldn't find a way to run multiple instances of Pyppeteer.

By the way, I'm using PyCharm on Windows 10 with Python 3.6.1 (3.6 throws an error regarding a 'Deque' thing that can't be imported) / 3.7; maybe this info helps in solving the issue.

I've tried to be as detailed as possible with the problems I'm facing right now and I hope I can get the solutions I'm looking for.

Thanks in advance!

P.S. Chromium is downloaded and it shows in task manager when running the render() function (same happens when running the Pyppeteer code).

chipswithdrips avatar Oct 28 '19 02:10 chipswithdrips

I've had the same issue and have been searching for a solution for quite some time.

rodcox89 avatar Oct 28 '19 16:10 rodcox89

I understand your situation too because I've searched for a solution for a few weeks and I don't know how much time it will take until we'll get a proper answer on this issue.

chipswithdrips avatar Oct 28 '19 16:10 chipswithdrips

So I'm guessing that this project is abandoned.

chipswithdrips avatar Nov 07 '19 09:11 chipswithdrips

yeah.... same. I found a better solution. I switched over to Splash Lua Docker HTTP API and couldn't be more pleased with the results.

rodcox89 avatar Nov 07 '19 22:11 rodcox89

同样的问题,即使把asyncio.get_event_loop()改成asyncio.new_event_loop(),也有问题, 提示:signal only works in main thread,多线程里没法用

wxtt522 avatar Nov 20 '19 08:11 wxtt522

+1, don't know if exists bug, or project is unmaintained......

BruceLee569 avatar Mar 15 '20 11:03 BruceLee569

Here is my workaround

tingwei628 avatar Mar 16 '20 14:03 tingwei628

Here is my workaround

pyppeteer is little heavy on resource and slow, is there any other library like aiohttp or requests which can render a javascript page and has the async support, Because requests_html is not working at all and running pyppeteer with async is heavy on system resource and also takes quit long amount of time, I passed 10 urls with async and it took more than a minute to render a javascript website and give the result.

Luciferianism avatar Apr 24 '20 10:04 Luciferianism

was there any solution?

awb715 avatar Jan 02 '21 23:01 awb715

how to set timeout for render javascript theese my code

    def get_response(self, url):
        session = HTMLSession()
        res = session.request(method='get', url=url, headers=self.headers, timeout=5)

        try:
            print('creating directory to append temporary file')
            os.makedirs('redfin_com_temporary')
        except FileExistsError:
            print('directory created')

        # create response temporary file
        f = open('redfin_com_temporary/res.html', 'w+')
        f.write(res.text)
        f.close()

        # status code
        print(f'Site Status Code: {res.status_code}')
        return res.html.render()

i got error

pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 8000 ms exceeded.

everybody can help me?

perymerdeka avatar Jan 27 '21 05:01 perymerdeka

My solution:

1.find function browser( ) in requests_html.py

//$python\Lib\site-packages\requests_html.py
async def browser(self):
        if not hasattr(self, "_browser"):
            self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
        return self._browser

2.replace headless value

headless=False

3.then, when render() function work, it will open Chromium to render successfully

ryankolter avatar Mar 12 '21 03:03 ryankolter

My solution:

1.find function browser( ) in requests_html.py

//$python\Lib\site-packages\requests_html.py
async def browser(self):
        if not hasattr(self, "_browser"):
            self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
        return self._browser

2.replace headless value

headless=False

3.then, when render() function work, it will open Chromium to render successfully

This worked for me after countless other things didn't. Thanks!

koljaoh avatar Mar 18 '21 07:03 koljaoh

in _cleanup_tmp_user_data_dir raise IOError('Unable to remove Temporary User Data')

pangzhilei avatar Jan 16 '23 18:01 pangzhilei