crawley
crawley copied to clipboard
Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
I can't find anything in the documentation about how to use mongodb to save the crawled data. Am I missing something ?
https://docs.python.org/2/library/urlparse.html#urlparse.urljoin provides a robust way to make a relative url into a absolute one. This fixes some issues like this one: When accessing this url: http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/ We find relative links...
Any ideas why I'm getting... ``` ImportError: cannot import name ScopedSession ```
I'm using PyQuery, and I get wrong encode detection for this page: http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/resultado_busca.html?nomeArq=0148.html The problem is that the html has this meta tag: `` But the page is actually `utf-8`...
Hi, there are some missing dependencies on master branch. If i try to use the shell, the follwoing packages are missing: - pymongo - couchdb - PyQt4
I tryed to use the shell command to test my xpaths, but it does'nt work. $ crawley shell http://somewebsite.com/index.html Traceback (most recent call last): File "/home/maik/.virtualenvs/crawley/bin/crawley", line 4, in manage()...
Implement a way to use a regex in the scraper's matching urls.
Create a crawley project that demostrate how to use the crawler's login and then scrape data behind sessioned pages.
Consider the posibility of make a simple webbrowser desktop application that allows the "end-users" scrape web pages with a GUI. This app should show the webpage to the user and...
Bumps [sqlalchemy](https://github.com/sqlalchemy/sqlalchemy) from 0.7.8 to 1.3.0. Release notes Sourced from sqlalchemy's releases. 1.3.0 Released: March 4, 2019 [feature] [schema] Added new parameters Table.resolve_fks and MetaData.reflect.resolve_fks which when set to False...