python-sitemap
python-sitemap copied to clipboard
Mini website crawler to make sitemap from a website.
This package seems quite popular and would benefit from being on [PyPI](https://pypi.org/). We could check out [Poetry](https://github.com/python-poetry/poetry) to keep it simple. I can take a look at doing this one...
If the URL contains UNICODE encoding, python will report an error. debug info: > INFO:root:Crawling #1: https://gvo.wiki/html/NPC掉落書籍.html > DEBUG:root:https://gvo.wiki/html/NPC掉落書籍.html ==> 'ascii' codec can't encode characters in position 13-16: ordinal no...
I got such error python3 main.py --domain https://domain.com --output sitemap.xml Traceback (most recent call last): File "main.py", line 60, in crawl.run() File "/root/python-sitemap/crawler.py", line 127, in run self.__crawl(current_url) File "/root/python-sitemap/crawler.py",...
sometimes we have URLs that are canonicalized to other pages, and these should not be included in the sitemap. See google's reference: https://developers.google.com/search/docs/advanced/sitemaps/build-sitemap So the logic would be to look...
Hi, I am getting a SyntaxError when trying to execute the file, no matter what link I type in. Also "" and '' don't work Is there a way to...
Hi, just wanted to say thanks for such a great library. One need we have is to generate a sitemap for a site that has more than 50,000 URLs. The...
I have a website with millions of categorized records, it will be useful if I could limit the number of urls to parse per section. E.g. the 900,000 first urls...
Number of found URL : 1 Number of links crawled : 1 python main.py --domain https://www.domain.com --output sitemap.xml --report ``` ```
We found that websites found the scraper too resource consuming. Therefore I added this configurable rate limiter, to be able to decrease the number of requests per time period.
The issue with this tool is once it halts, your have to start all over again from scratch. And with large sites this is a very common scenario. Since we...