Andrei Paraschiv
Andrei Paraschiv
**Comment by [agnelvishal](https://github.com/agnelvishal)** _Wed Nov 21 16:07:25 2018_ ---- You can use http://commoncrawl.org/ api. An example is available at https://github.com/agnelvishal/Condense.press/tree/master/backend/cdx-index-client-master
**Comment by [agnelvishal](https://github.com/agnelvishal)** _Wed Nov 21 16:09:03 2018_ ---- I already have the dataset and am in need of intern. Can you help?
**Comment by [nanaya07](https://github.com/nanaya07)** _Thu Nov 22 03:14:35 2018_ ---- I am already done with the project. I just used more news websites. Thank you.
**Comment by [yprez](https://github.com/yprez)** _Mon May 30 19:01:24 2016_ ---- There was some code in cleaners.py that cleans up divs with this sort of thing. But in this case, it looks...
**Comment by [ekingery](https://github.com/ekingery)** _Mon Oct 15 21:36:56 2018_ ---- I am in a similar situation, in that I've found various minor issues with the way article content is parsed and...
**Comment by [yprez](https://github.com/yprez)** _Tue May 10 18:31:50 2016_ ---- @tehnar can you provide a specific example?
**Comment by [tehnar](https://github.com/tehnar)** _Tue May 10 18:38:40 2016_ ---- @yprez For example, http://blog.jetbrains.com/ruby/ Newspaper thinks that there are only 94 articles while the real amount is much larger. The latest...
**Comment by [yprez](https://github.com/yprez)** _Wed May 11 10:28:39 2016_ ---- Did you try disabling cache? e.g. `newspaper.build('http://blog.jetbrains.com/ruby/', memoize_articles=False)` I'm getting 127 articles from http://blog.jetbrains.com/ruby/, not sure if it's all of them...
**Comment by [tehnar](https://github.com/tehnar)** _Wed May 11 10:37:52 2016_ ---- @yprez I'm still getting only 95 articles (I disabled caching and removed ~/.newspaper_scraper). There are 32 pages (the last page http://blog.jetbrains.com/ruby/page/32/),...
**Comment by [tehnar](https://github.com/tehnar)** _Fri May 13 13:37:09 2016_ ---- Also, publish dates are not extracted properly for all the articles. For example, publish date for this article http://blog.jetbrains.com/ruby/2016/05/rubymine-2016-1-1-security-update/ is not...