scrapy-playwright Allocation failed - JavaScript heap out of memory

Allocation failed - JavaScript heap out of memory

Open phongtnit opened this issue 2 years ago • 10 comments

Hi,

This issue related to #18

The error still occurred with scrapy-playwright 0.0.4. The Scrapy script crawled about 2500 domains in 10k from majestic and crashed with the last error JavaScript heap out of memory. So I think this is a bug.

My main code:

domain = self.get_domain(url=url)

context_name = domain.replace('.', '_')
yield scrapy.Request(
    url=url,
    meta={
        "playwright": True,
        "playwright_page_coroutines": {
            "screenshot": PageCoroutine("screenshot", domain + ".png"),
        },
        # Create new content
        "playwright_context": context_name,
    },
)

My env:

Python 3.8.10
Scrapy 2.5.0
playwright 1.12.1
scrapy-playwright 0.0.04

The detail of error:

2021-07-17 14:47:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.costco.com/>: HTTP status code is not handled or not allowed
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: 0xa18150 node::Abort() [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 2: 0xa1855c node::OnFatalError(char const*, char const*) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 3: 0xb9715e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 4: 0xb974d9 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 5: 0xd54755  [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 6: 0xd650a8 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 7: 0xd2bd9d v8::internal::Factory::NewFixedArrayWithFiller(v8::internal::RootIndex, int, v8::internal::Object, v8::internal::AllocationType) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 8: 0xd2be90 v8::internal::Handle<v8::internal::FixedArray> v8::internal::Factory::NewFixedArrayWithMap<v8::internal::FixedArray>(v8::internal::RootIndex, int, v8::internal::AllocationType) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
 9: 0xf5abd0 v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::Allocate(v8::internal::Isolate*, int, v8::internal::AllocationType) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
10: 0xf5ac81 v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::Rehash(v8::internal::Isolate*, v8::internal::Handle<v8::internal::OrderedHashMap>, int) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
11: 0xf5b2cb v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::EnsureGrowable(v8::internal::Isolate*, v8::internal::Handle<v8::internal::OrderedHashMap>) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
12: 0x1051b38 v8::internal::Runtime_MapGrow(int, unsigned long*, v8::internal::Isolate*) [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
13: 0x140a8f9  [/home/ubuntu/.local/lib/python3.8/site-packages/playwright/driver/node]
Aborted (core dumped)
2021-07-17 14:48:34 [scrapy.extensions.logstats] INFO: Crawled 2533 pages (at 15 pages/min), scraped 2362 items (at 12 items/min)

Temporary fix: I replaced line 166 with await page.context.close() to close current context in handler.py because my script had one context per one domain. It will fix the error Allocation failed - JavaScript heap out of memory and the Scrapy script crawled all 10k domains, but the successful rate was about 72% in comparison with no added code (about 85% successful rate). Also, when I added the new code, the new error was:

2021-07-17 15:04:59 [scrapy.core.scraper] ERROR: Error downloading <GET http://usatoday.com>
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 824, in adapt
    extracted = result.result()
  File "/home/ubuntu/python/scrapy-playwright/scrapy_playwright/handler.py", line 138, in _download_request
    result = await self._download_request_with_page(request, page)
  File "/home/ubuntu/python/scrapy-playwright/scrapy_playwright/handler.py", line 149, in _download_request_with_page
    response = await page.goto(request.url)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 6006, in goto
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_page.py", line 429, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_frame.py", line 117, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Navigation failed because page was closed!

...

2021-07-17 19:31:15 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-38926' coro=<Route.continue_() done, defined at /home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py:544> exception=Error('Target page, context or browser has been closed')>
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 582, in continue_
    await self._async(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_network.py", line 207, in continue_
    await self._channel.send("continue", cast(Any, overrides))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed

....

2021-07-18 03:51:34 [scrapy.core.scraper] ERROR: Error downloading <GET http://bbc.co.uk>
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 824, in adapt
    extracted = result.result()
  File "/home/ubuntu/python/scrapy-playwright/scrapy_playwright/handler.py", line 138, in _download_request
    result = await self._download_request_with_page(request, page)
  File "/home/ubuntu/python/scrapy-playwright/scrapy_playwright/handler.py", line 165, in _download_request_with_page
    body = (await page.content()).encode("utf8")
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/async_api/_generated.py", line 5914, in content
    await self._async("page.content", self._impl_obj.content())
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_page.py", line 412, in content
    return await self._main_frame.content()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_frame.py", line 325, in content
    return await self._channel.send("content")
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/playwright/_impl/_connection.py", line 54, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Execution context was destroyed, most likely because of a navigation.

Jul 17 '21 15:07 phongtnit

There is no need to patch the handler code, closing a context can be done using the existing API. I understand it might seem a bit verbose but I don't want to create a whole DSL around this to handle context/page creation/deletion. The new error is because you're trying to download pages with an already closed context, which makes sense if you're closing the context immediately after downloading each page. It's hard to say without knowing exactly what self.get_domain returns (I suppose something involving urllib.parse.urlparse(url).netloc, but I'm just guessing), but I suspect you might have some urls in your list that correspond to the same domain(s). I think you could probably get a good performance by grouping URLs in batches (let's say, 1K per context) and closing each context after that, but that might be too complex; a quick solution to download one response per domain and have non-clashing contexts would be to pass a uuid.uuid4() object as context name for each URL. Given that the underlying Allocation failed - JavaScript heap out of memory seems to be an upstream issue, I don't see much else we can do on this side to prevent it.

Jul 19 '21 13:07 elacuesta

Hmm, I got the same error after a few hours when scraping just a single domain. Could it be related to error #15 which pops up a fair bit? Any way I can increase the memory heap?

Context '1': new page created, page count is 1 (1 for all contexts) FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory 1: 0xa18150 node::Abort() [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 2: 0xa1855c node::OnFatalError(char const*, char const*) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 3: 0xb9715e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 4: 0xb974d9 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 5: 0xd54755 [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 6: 0xd650a8 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 7: 0xd2bd9d v8::internal::Factory::NewFixedArrayWithFiller(v8::internal::RootIndex, int, v8::internal::Object, v8::internal::AllocationType) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 8: 0xd2be90 v8::internal::Handlev8::internal::FixedArray v8::internal::Factory::NewFixedArrayWithMapv8::internal::FixedArray(v8::internal::RootIndex, int, v8::internal::AllocationType) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 9: 0xf5abd0 v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::Allocate(v8::internal::Isolate*, int, v8::internal::AllocationType) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 10: 0xf5ac81 v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::Rehash(v8::internal::Isolate*, v8::internal::Handlev8::internal::OrderedHashMap, int) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 11: 0xf5b2cb v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::EnsureGrowable(v8::internal::Isolate*, v8::internal::Handlev8::internal::OrderedHashMap) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 12: 0x1051b38 v8::internal::Runtime_MapGrow(int, unsigned long*, v8::internal::Isolate*) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] 13: 0x140a8f9 [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node] Aborted Crawled 4171 pages (at 8 pages/min), scraped 114799 items (at 186 items/min)

Jul 25 '21 07:07 xanrag

Are you using a single context for this domain? If so, you're falling into https://github.com/microsoft/playwright/issues/6319.

This seems like an issue on the Node.js side of things. I'm no JS developer, so take the following with a grain of salt, but from what I've found you should be able to increase the memory limit by setting NODE_OPTIONS=--max-old-space-size=<size> as an environment variable.

Sources and further reading:

https://github.com/npm/npm/issues/12238#issuecomment-367147962
https://medium.com/the-node-js-collection/node-options-has-landed-in-8-x-5fba57af703d
https://nodejs.org/dist/latest-v8.x/docs/api/cli.html#cli_node_options_options
https://nodejs.org/api/cli.html#cli_max_old_space_size_size_in_megabytes

Jul 25 '21 14:07 elacuesta

Thank you, setting the NODE_OPTIONS seems to have solved the memory issue and It can run for 24h+ without crashing in a single context.

Aug 03 '21 15:08 xanrag

Thank you, setting the NODE_OPTIONS seems to have solved the memory issue and It can run for 24h+ without crashing in a single context.

Hi @xanrag How did you do to fix the JavaScript heap out of memory error? Which options do you setup?

Aug 05 '21 14:08 phongtnit

Hi @xanrag How did you do to fix the JavaScript heap out of memory error? Which options do you setup?

Just the memory setting, I added this to my docker-compose and it seems to work: environment: - NODE_OPTIONS=--max-old-space-size=8192

Aug 05 '21 15:08 xanrag

Hi @xanrag How did you do to fix the JavaScript heap out of memory error? Which options do you setup?

Just the memory setting, I added this to my docker-compose and it seems to work: environment:

NODE_OPTIONS=--max-old-space-size=8192

Thanks @xanrag I will try to test my script with new env setting.

Aug 06 '21 03:08 phongtnit

@xanrag Hi, did you get "Aborted (core dumped)" error anymore?

I added export NODE_OPTIONS=--max-old-space-size=8192 in ~/.profile file and run Scrapy script. However, the error Aborted (core dumped) still occurs when Scrapy Playwright crawled more than 10k urls, sometime about 100k urls.

Sep 08 '21 10:09 phongtnit

@phongtnit
I met the issue too. so created context per page and close page and context at the same time like you. but I faced to different issue about chrome process fork error over 7000 pages. i am searching it now.

@xanrag Hi, did you get "Aborted (core dumped)" error anymore?

I added export NODE_OPTIONS=--max-old-space-size=8192 in ~/.profile file and run Scrapy script. However, the error Aborted (core dumped) still occurs when Scrapy Playwright crawled more than 10k urls, sometime about 100k urls.

Sep 09 '21 02:09 hi-time

@xanrag Hi, did you get "Aborted (core dumped)" error anymore?

I'm not sure. When I run scrapy in celery as a separate process it doesn't log to the file when it crashes. There is something still going on though because ocassionally it stops and keeps putting out the same page/item count indefinitely without stopping and I have another issue where it doesn't kill the chrome process correctly but I'll investigate more and start another issue for that if I find anything. (A week of use spawned a quarter of a million zombie processes...)

Sep 14 '21 16:09 xanrag

9fe18b5e9363ed87afca04eb3dda8bf2679ef938

Feb 07 '23 20:02 elacuesta

@elacuesta hey I'm having this problem were my computer starts freezing after 1/2hours of running my crawler. I'm pretty sure it's due to this playwright issue you linked (https://github.com/microsoft/playwright/issues/6319) where it's taking up more and more memory. It seems like a workaround is to recreate the page every x minutes but I'm not sure how to do this.

I'm already doing all playwright requests with playwright_context="new" and that doesn't fix it.

I'm new to this, can you give me pointers on how I can create a new page or context (?) every x minutes? I'm currently unable to figure this out from the documentation on my own.

I've added my spider in case you're interested

spider

import logging
from typing import Optional

import bs4
import scrapy
from scrapy_playwright.page import PageMethod

from jobscraper import storage
from jobscraper.items import CybercodersJob

class CybercodersSpider(scrapy.Spider):
    name = 'cybercoders'
    allowed_domains = ['cybercoders.com']

    loading_delay = 2500

    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/109.0'}
    request_meta = dict(
        playwright=True,
        playwright_context="new",
        # You can define page actions (https://playwright.dev/python/docs/api/class-page)
        playwright_page_methods=[
            PageMethod("wait_for_timeout", loading_delay)
            # TODO instead of waiting, wait for the page to load (look for a specific element)
        ]
    )

    def get_search_url(self, page: Optional[int] = 1) -> str:
        page_string = f"page={page}&" if page else ""
        return f"https://www.cybercoders.com/jobs/?{page_string}&worklocationtypeid=3"

    def start_requests(self):
        yield scrapy.http.Request(
            self.get_search_url(),
            headers=self.headers,
            cb_kwargs={'page': 1},
            meta=self.request_meta,
            callback=self.parse
        )

    def parse(self, response, **kwargs):
        """
        Parses the job-search page
        """

        # get all job_links
        job_links = response.css('div.job-title a::attr(href)').getall()

        # If there are no job links on the page, the page is empty so we can stop
        if not job_links:
            return

        # Go to the next search page
        yield scrapy.http.Request(
            self.get_search_url(kwargs['page'] + 1),
            headers=self.headers,
            cb_kwargs={'page': kwargs['page'] + 1},
            meta=self.request_meta,
            callback=self.parse
        )

        # Go to each job page
        for link in job_links:
            job_id = link.split('/')[-1]
            if job_id and storage.has_job_been_scraped(CybercodersJob, job_id):
                continue
            yield response.follow("https://www.cybercoders.com" + link, callback=self.parse_job, headers=self.headers,
                                  meta=self.request_meta)

    def parse_job(self, response, **kwargs):
        """
        Parses a job page
        """

        try:
            soup = bs4.BeautifulSoup(response.body, 'html.parser')

            details = dict(
                id=response.url.split('/')[-1],
                url=response.url,
                description=soup.find('div', class_='job-details-content').find('div',
                                                                                class_='job-details') if soup.find(
                    'div', class_='job-details-content') else None,
                title=response.css('div.job-title h1::text').get() if response.css('div.job-title h1::text') else None,
                skills=response.css('div.skills span.skill-name::text').getall() if response.css(
                    'div.skills span.skill-name::text') else None,
                location=response.css('div.job-info-main div.location span::text').get() if response.css(
                    'div.job-info-main div.location span::text') else None,
                compensation=response.css('div.job-info-main div.wage span::text').get() if response.css(
                    'div.job-info-main div.wage span::text') else None,
                posted_date=response.css('div.job-info-main div.posted span::text').get() if response.css(
                    'div.job-info-main div.posted span::text') else None,
            )

            for key in ['title', 'description', 'url']:
                if details[key] is None:
                    logging.warning(f"Missing value for {key} in {response.url}")

            yield CybercodersJob(
                **details
            )

        except Exception as e:
            logging.error(f"Something went wrong parsing {response.url}: {e}")

Feb 09 '23 14:02 Stijn-B

Passing playwright_context="new" for all requests will not make a new context for each request, it will only make all requests go trough a single context named "new". I'd recommend generating randomly named contexts, maybe using random or uuid. That said, one context per request is probably too much, perhaps a good middle point would be one context for each listing page and its derived links, i.e. use the same context for response.follow calls but generate a new one for the requests to increment the listing page number.

Feb 09 '23 17:02 elacuesta

@elacuesta Oh ok, good idea. Thanks! After looking online I'm not 100% sure whether I have to close a context manually or if just using a new playwright_context="new-name" is enough? If I have to close it manually, can you point me to the documentation about this?

Feb 09 '23 17:02 Stijn-B

If I have to close it manually, can you point me to the documentation about this?

https://github.com/scrapy-plugins/scrapy-playwright#closing-a-context-during-a-crawl

Feb 09 '23 20:02 elacuesta

scrapy-playwright scrapy-playwright copied to clipboard

Allocation failed - JavaScript heap out of memory

scrapy-playwright
scrapy-playwright copied to clipboard