lazynlp icon indicating copy to clipboard operation
lazynlp copied to clipboard

Library to scrape and clean web pages to create massive datasets.

Results 10 lazynlp issues
Sort by recently updated
recently updated
newest added

Hello. I'm new to open source contribution. I saw your issue #6 and created a robots.py file that might help you. `read_disallows(url)` : takes in a url and returns the...

Added headers to urllib. Detailed in the issue [here](https://github.com/chiphuyen/lazynlp/issues/11)

Have you considered adding a metric to assess the text quality of the documents, for example using the frequencies of short frequent words? (http://rolandschaefer.net/?p=78)

Hi, Thanks for this great tool. I noticed urllib fails with a `Forbidden Request` error when I call `download_page` on some links. You can reproduce the error by trying the...

One might as well extract structured data from each element of such a dataset. Linked data. https://5stardata.info/ Useful features: - Relations to e.g. https://schema.org/Dataset (s) - Reified edges to other...

help wanted

Get important (?!) Images of the webpages in markdown style.

Hello, I am reaching out regarding your source code file for your Python codes (crawl.py). After running tests using Pylint a few errors present in the source code were found....

Hello, I am reaching out regarding your source code files for your Python codes. After running tests using Pyflakes and Pylint, there were a few errors present in the source...

Hello, I am reaching out regarding your Python code. After running tests using Pylint and Pyflakes, there are a few errors considering used variables that are present in the source...