lazynlp
lazynlp copied to clipboard
Library to scrape and clean web pages to create massive datasets.
Hello. I'm new to open source contribution. I saw your issue #6 and created a robots.py file that might help you. `read_disallows(url)` : takes in a url and returns the...
Added headers to urllib. Detailed in the issue [here](https://github.com/chiphuyen/lazynlp/issues/11)
Have you considered adding a metric to assess the text quality of the documents, for example using the frequencies of short frequent words? (http://rolandschaefer.net/?p=78)
Hi, Thanks for this great tool. I noticed urllib fails with a `Forbidden Request` error when I call `download_page` on some links. You can reproduce the error by trying the...
One might as well extract structured data from each element of such a dataset. Linked data. https://5stardata.info/ Useful features: - Relations to e.g. https://schema.org/Dataset (s) - Reified edges to other...
Get important (?!) Images of the webpages in markdown style.
Hello, I am reaching out regarding your source code file for your Python codes (crawl.py). After running tests using Pylint a few errors present in the source code were found....
Hello, I am reaching out regarding your source code files for your Python codes. After running tests using Pyflakes and Pylint, there were a few errors present in the source...
Hello, I am reaching out regarding your Python code. After running tests using Pylint and Pyflakes, there are a few errors considering used variables that are present in the source...