crawler Do not spawn a gen_server for ever worker run

Worker is being spawned by OPQ's WorkerSupervisor which is a GenStage's ConsumerSupervisor and apparently it doesn't dispose of the children after the demand is met, so every GenServer started for every event still hangs there consuming the process count.

This PR removes that and runs the worker directly.

Testing with this modification, the memory consumption didn't surpass 500mb whereas before it would fill up my machine's memory.

Options used:

  workers: 10,
  interval: 500,
  timeout: 5000,
  max_depths: 5,

Solves #9

May 23 '18 19:05 rhnonose

Coverage decreased (-0.3%) to 98.837% when pulling e0ba256cb4f81e6edfa405febcbc5741ac007f27 on rhnonose:master into fedd5a0c27b345d1db5061304c9b7fdcdfea887c on fredwu:master.

May 23 '18 19:05 coveralls

Hmm interesting, nice find! I do have one question though, what happens if a worker dies? Would the entire process stop?

May 23 '18 23:05 fredwu

@fredwu it's supervised by WorkerSupervisor which is a ConsumerSupervisor, one_for_one strategy. The restart is temporary so it'll not be restarted, but that's on OPQ's side.

May 23 '18 23:05 rhnonose

Thanks @rhnonose! I'll need to fix up the test suite (broken due to newer version of Elixir, unrelated to this PR) and after that I'll merge this one in.

May 24 '18 00:05 fredwu

thanks so much @rhnonose!

I ran my crawler with your fork and I don't see it actually crawling pages. Is it working for you locally?

May 24 '18 03:05 robvolk

@robvolk it seems like it, but I'll take a closer look

May 24 '18 03:05 rhnonose

@robvolk I had to increase the number of workers and depth a little bit to get similar numbers of results as before (workers: 30, max_depth: 5), but other than that it's working fine.

May 24 '18 14:05 rhnonose

When I run the crawler with your change, it starts crawling, but then stops pretty quickly. I can't figure out why. I feel like it's now actually calling the parser, but I don't know for sure why it's stopping.

May 24 '18 23:05 robvolk

@robvolk send me an e-mail with how you're running the crawler, maybe I can help.

May 25 '18 02:05 rhnonose

The test suite is now fixed on master. Any more info on @robvolk's issue on this branch?

Jun 30 '18 06:06 fredwu

@fredwu nothing Merging with master, it still breaks 3 tests. I'll take a look to fix it.

Jun 30 '18 14:06 rhnonose

@rhnonose 's changes work! Thanks so much for fixing the crawler!! Now I'm able to crawl a site deep and my server doesn't run out of memory.

Had to use both of these to get it to work:

{:crawler, github: "rhnonose/crawler"},
{:opq, github: "rhnonose/opq", override: true},

Jul 03 '18 04:07 robvolk

@robvolk is master fixed yet or still have to use @rhnonose’s branch to not run out of memory on large sites?

Aug 15 '18 13:08 mazz

@mazz you can try using the latest version of opq which I addresses some similar issues, the PR was merged but I'm not sure if there's a patch in hex

Aug 15 '18 21:08 rhnonose

@rhnonose ok I'll just try

{:crawler, github: "rhnonose/crawler"},
{:opq, github: "rhnonose/opq", override: true},

Aug 16 '18 00:08 mazz

In addition to the latest crawler at master, I’m using latest OPQ On Wed, Aug 15, 2018 at 7:33 PM Michael [email protected] wrote:

@rhnonose https://github.com/rhnonose ok I'll just try

{:crawler, github: "rhnonose/crawler"}, {:opq, github: "rhnonose/opq", override: true},

above

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/fredwu/crawler/pull/11#issuecomment-413383158, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkWzNpTJ7IoRetrtZBvCQx3QfCiw_XTks5uRL3DgaJpZM4ULChg .

-- Rob

Aug 16 '18 14:08 robvolk

@robvolk what does your mix.exs look like? when I use

      {:crawler, "~> 1.0"},
      {:opq, "~> 3.1"}

I see a deps conflict:

Failed to use "opq" (version 3.1.0) because
  crawler (version 1.0.0) requires ~> 2.0
  mix.exs specifies ~> 3.1

another issue: when I was using

      {:crawler, github: "rhnonose/crawler"},
      {:opq, github: "rhnonose/opq", override: true}

and would call iex(1)> Crawler.crawl("http://elixir-lang.org", max_depths: 2)

I would never get a successful fetch:

14:12:49.613 [debug] "Fetch failed 'not_fetched_yet?', with opts: %{assets: [], content_type: \"text/html\", depth: 1, encode_uri: false, headers: [{\"Server\", \"GitHub.com\"}, {\"Content-Type\", \"text/html; charset=utf-8\"}, {\"Last-Modified\", \"Tue, 14 Aug 2018 11:12:02 GMT\"}, {\"Access-Control-Allow-Origin\", \"*\"}, {\"Expires\", \"Thu, 16 Aug 2018 02:22:07 GMT\"}, {\"Cache-Control\", \"max-age=600\"}, {\"X-GitHub-Request-Id\", \"C724:518F:1E94E89:2891DED:5B74DD74\"}, {\"Content-Length\", \"20094\"}, {\"Accept-Ranges\", \"bytes\"}, {\"Date\", \"Thu, 16 Aug 2018 18:12:49 GMT\"}, {\"Via\", \"1.1 varnish\"}, {\"Age\", \"0\"}, {\"Connection\", \"keep-alive\"}, {\"X-Served-By\", \"cache-cmh8822-CMH\"}, {\"X-Cache\", \"HIT\"}, {\"X-Cache-Hits\", \"1\"}, {\"X-Timer\", \"S1534443169.476741,VS0,VE17\"}, {\"Vary\", \"Accept-Encoding\"}, {\"X-Fastly-Request-ID\", \"193a22bc2516efc319c0777a5146f158340e81f8\"}], html_tag: \"a\", interval: 0, max_depths: 2, modifier: Crawler.Fetcher.Modifier, parser: Crawler.Parser, queue: #PID<0.207.0>, referrer_url: \"http://elixir-lang.org\", retrier: Crawler.Fetcher.Retrier, save_to: nil, scraper: Crawler.Scraper, timeout: 5000, url: \"http://elixir-lang.org\", url_filter: Crawler.Fetcher.UrlFilter, user_agent: \"Crawler/1.0.0 (https://github.com/fredwu/crawler)\", workers: 10}."

Aug 16 '18 21:08 mazz

Here's my mix.exs @mazz. I'm not sure if I need to reference @rhnonose's OPQ directly anymore, but I did need it back when he created his branch for crawler.

      {:crawler, git: "https://github.com/fredwu/crawler.git"},
      {:opq, github: "rhnonose/opq", override: true},

Aug 18 '18 13:08 robvolk

I think worker.ex doesn't need use GenServer anymore

Aug 18 '18 21:08 Awlexus

I did some more digging and if I use this branch, then it does indeed fix the memory issue. It introduces a new issue though - URLs will be crawled over and over and over.

For example, URLs in a page's header will get scraped and crawled over and over. I'm not sure why. It looks like they might not be getting marked as processed anymore in the Store. Any ideas?

Sep 30 '18 22:09 robvolk

I'm not sure if I need to reference @rhnonose's OPQ directly anymore, but I did need it back when he created his branch for crawler.

This is no longer needed, as the changes in @rhnonose's fork has been merged into the main repo.

I did some more digging and if I use this branch, then it does indeed fix the memory issue. It introduces a new issue though - URLs will be crawled over and over and over.

For example, URLs in a page's header will get scraped and crawled over and over. I'm not sure why. It looks like they might not be getting marked as processed anymore in the Store. Any ideas?

I'd love others to help out on this issue, as I'm not using Crawler actively any more it's difficult for me to spend enough time and energy on debugging this issue. I'm happy to incorporate fixes and/or enhancements once they're proven to work. :)

Oct 01 '18 05:10 fredwu

Thanks for the reply. If @rhnonose 's changes were merged in, why am I seeing different behavior when I reference his fork and yours? When I reference your main branch, the memory climbs linearly, but when I reference his, it grows to a point, then stays flat (as resources are released).

Oct 01 '18 15:10 robvolk

There's a bunch of changes since my fork, it might be something new. Also, opq is not the latest.

Oct 01 '18 17:10 rhnonose

With your new changes have you noticed pages are scraped multiple times? On Mon, Oct 1, 2018 at 12:29 PM Rodrigo Nonose [email protected] wrote:

There's a bunch of changes since my fork, it might be something new. Also, opq is not the latest.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fredwu/crawler/pull/11#issuecomment-425993193, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkWzN59SYJuzz6xCA4Ohid7OFCU1dubks5uglDjgaJpZM4ULChg .

--

Rob Volk

Foxbox / Founder

Inspiring app development

foxbox.co

516.238.9982

Read: How We Create Affordable Mobile Apps for Businesses https://foxbox.co/blog/how-we-create-affordable-mobile-apps-for-businesses/

Oct 01 '18 17:10 robvolk

I don't use the crawler in a while unfortunately.

Oct 01 '18 17:10 rhnonose

Where did things settle here? I'm in need of an Elixir crawler and am willing to devote the bandwidth needed to resolve this issue, but want to make sure I'm starting with proper assumptions.

Jun 25 '19 11:06 adamdotdevin

@adamelmore Hmm, I don't believe there's been any progress on this...

Jun 25 '19 11:06 fredwu

Was there consensus on the approach taken in this PR, and we just need to iron out the kinks (pages being scraped multiple times)? Or should I approach #9 from scratch?

Jun 25 '19 11:06 adamdotdevin

@adamelmore we ran our crawler without issues a couple of times If you want to check the exact dependencies: https://github.com/emcasa/hammock I had to use forks and, since we're not using it anymore, I didn't bother fixing it later when PRs got merged

Jun 25 '19 12:06 rhnonose

Is that a private repo, maybe? Getting a 404.

Jun 25 '19 13:06 adamdotdevin

crawler crawler copied to clipboard

Do not spawn a gen_server for ever worker run

crawler
crawler copied to clipboard