sitemap-generator-cli icon indicating copy to clipboard operation
sitemap-generator-cli copied to clipboard

Large site issue

Open bartleman opened this issue 7 years ago • 6 comments

I increased node to use 32gb so I could crawl a site with 2+ mil links. When it eventually finished, it only generated files 27-44 without displaying any errors.

bartleman avatar Jan 08 '18 17:01 bartleman

So files 1-26 are missing? Memory should not be a problem if you are using v6 the sitemap data is streamed to files which should not consume much memory space.

lgraubner avatar Jan 09 '18 09:01 lgraubner

Yes, those files are missing and it takes almost 2 weeks to test unless there’s a way to increase connections?

On Tue, Jan 9, 2018 at 2:23 AM Lars Graubner [email protected] wrote:

So files 1-26 are missing? Memory should not be a problem if you are using v6 the sitemap data is streamed to files which should not consume much memory space.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lgraubner/sitemap-generator-cli/issues/25#issuecomment-356227233, or mute the thread https://github.com/notifications/unsubscribe-auth/AD0LlRLO8HQsixqbxpwgfnt7cM0OCL16ks5tIzAkgaJpZM4RWuiG .

bartleman avatar Jan 09 '18 14:01 bartleman

Currently it's at five requests/s. This option could be exposed easily. Hopefully I will have some spare time this weekend to check why files are missing.

lgraubner avatar Jan 12 '18 22:01 lgraubner

I could not reproduce the issue. I tried to generate some more sitemaps by lowering the maximum number of url's per file (see option) but it seems to work fine. Maybe there is indeed a difference when the file size is bigger. Anyways I suspect it might have to do with some async iteration. Changed it to serial execution instead.

Any chance you could check out the branch linked above and test? It's not the cli, but you can test it easily with the following code:

const SitemapGenerator = require('./lib')

const gen = SitemapGenerator('https://example.com', {
  maxConcurrency: 20
})

gen.start()

Simply run it with node index.js.

I also exposed the maxConcurrency option which specifies the number of workers used.

lgraubner avatar Jan 13 '18 23:01 lgraubner

Testing with concurrency set to 100, it appears to run at the same rate and doesn't increase the speed.

I also get a JavaScript heap out of memory error after about an hour of running unless I increase it. I'm currently using: --max_old_space_size=16384

bartleman avatar Jan 16 '18 18:01 bartleman

I think it would be quite handy if you added the max concurrency as a flag.

RayBB avatar Mar 27 '18 05:03 RayBB