Rcrawler crawlUrlfilter

crawlUrlfilter

Open mustaszewski opened this issue 5 years ago • 0 comments

Thank you for developing this very useful package. However, I have a problem with the crawlUrlfilter argument. From a large website, I would like to crawl and scrape only those URLs that match a specific pattern. According to the documentation, the crawlUrlfilter does exactly what I am looking for.

When the pattern passed to crawlUrlfilter contains only one level of the URL, like in the following code Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/")

I get the desired results, i.e. only those URLS that match the pattern "article", e.g.

https://www.somewebsite.org/article/sample-article-217 or https://www.somewebsite.org/article/2019-01-20-another-example

However, when I want to filter URLs based on a pattern of two levels of the URL, such as:

https://www.somewebsite.org/article/news/january-2019-meeting_of_trainers or https://www.somewebsite.org/article/news/review-of-meetup

the following code does not find any matches:

Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/news")

Is this a bug, or am I getting something wrong? Following the example given in the documentation dataUrlfilter ="/[0-9]{4}/[0-9]{2}/[0-9]{2}/" it should be no problem at all passing an argument that contains several "/".

Mar 25 '19 14:03 mustaszewski

Rcrawler Rcrawler copied to clipboard

crawlUrlfilter

Rcrawler
Rcrawler copied to clipboard