web-scraper-chrome-extension icon indicating copy to clipboard operation
web-scraper-chrome-extension copied to clipboard

multiple startUrls apparently does not working / stop the startUrls pagination when condition

Open antonio24073 opened this issue 4 years ago • 4 comments

Hi,

Good job with the plugin.

Chrome: Version 87.0.4280.141 (Official Build) (64-bit) Ubuntu 20.04.1 LTS 64-bit

But I'm trying to use this:

Supported URL patterns:
1. Numeric with optional step and zero padding – [START_END:STEP] – [001_010:10]

my sitemap:

{"_id":"google","startUrls":["http://google.com.br?id=[001_010:10]"],"selectors":[{"id":"body","selector":"body","type":"SelectorHTML","parentSelectors":["_root"]}]}

and the pagination does not work.

image image image log.log

I tried with 3.6 and it does not work again.

I would like the loop to stop the pagination when conditions like repeated elements or html contain.

Thank you.

antonio24073 avatar Jan 19 '21 15:01 antonio24073

Maybe there is a misunderstanding of documentation, but you should use url pattern as [START-END:STEP] – [001-010:10]. https://github.com/ispras/web-scraper-chrome-extension/blob/master/docs/Scraping%20a%20site.md

Yatskov avatar Jan 19 '21 16:01 Yatskov

Seems there is a mistake in plugin hints. Will fix it.

Yatskov avatar Jan 19 '21 16:01 Yatskov

I think in your case you should use this sitemap as example: {"_id":"google","startUrls":["http://google.com.br?id=[001-010:1]"],"selectors":[{"id":"body","selector":"body","type":"SelectorHTML","parentSelectors":["_root"]}]} then it will make requests to: https://www.google.com.br/?id=010 ... https://www.google.com.br/?id=001

Yatskov avatar Jan 19 '21 16:01 Yatskov

It works!!! :) Until I try to see the big regex in the Sitemap.js getStartUrls()... Now all everything became clear. kkkk Add my feature idea to the queue. (I would like the loop to stop the pagination when conditions like repeated elements or html contain). Thank you very much!!!

antonio24073 avatar Jan 19 '21 16:01 antonio24073

Hi again. Same problem again. I had forgotten.

In the Edit metadata there is an help:

1. Numeric with optional step and zero padding – [START_END:STEP] – [001_010:10]

but the correct is:

1. Numeric with optional step and zero padding – [START_END:STEP] – [001-010:10]

with - not _

Please replace it.

Thank you

antonio24073 avatar Oct 19 '23 19:10 antonio24073