PersianCrawler
PersianCrawler copied to clipboard
Open source crawler for Persian websites.
Crawler
Open source crawler for Persian websites. Crawled websites to now:
Asriran
asriran/run_asriran.sh
You can change some paramters in this crawler. See
run_asriran.sh
.
Fa-Wikipedia
Due to some problems in crawling, I splitted this job into two stages. First crawling all index pages and second use those pages for crawling.
wikipedia/run_wikipedia.sh
Tasnim News
This crawler saves tasnim news pages based on category. This is appopriate for text classification task as data is relatively balanced across all categories. I selected equal amount of page per category.
We have a parameter Called
Number_of_pages
intasnim.py
which controls how many pages we should crawl in each category.
tasnim/run_tasnim.sh
Datasets are all available for download at Kaggle.
CSS selectors are mostly extracted via Copy Css Selector.
- https://stackoverflow.com/questions/73859249/attributeerror-module-openssl-ssl-has-no-attribute-sslv3-method
- https://stackoverflow.com/a/73867925/4201765