newspaper4k icon indicating copy to clipboard operation
newspaper4k copied to clipboard

Extending keywords or even OWL rdfs labels for domain specific searches

Open AndyTheFactory opened this issue 2 years ago • 0 comments

Issue by tomthebuzz Wed Jul 18 14:43:41 2018 Originally opened as https://github.com/codelucas/newspaper/issues/596


Like your work a lot and would like to expand on it. Did you ever think about being able to extend or change the specific NLTK sets you use for domain specific parsing and extraction. Likely this already could be done by extending the NLTK sets used but I guess we need to follow a more holistic and structured approach so it can be used generically for different use cases and purposes.

Happy to pitch in if we find a sufficiently large group of interested folks....

Also I have issues with the extracted content still carrying messy html and return strings (see below). Any idea how best to tune to get rid of them (besides re-parsing them of course).

Example Output {"CN_Content": [{"id": 17, "content": "Over the past few years, the buzzwords \u201cBlockchain Technology\u201d have been tied to literally every industry under the sun. Cryptocurrencies had a wild year in 2017 and it seems during that time distributed ledger technology has made it all the way to the \u2018Blockchain 5.0\u2019 era, but there\u2019s a big problem \u2014 No one reporting on these projects has tried these networks.\n\nAlso read: Bitcoin: A Legitimate Cure for a Broken Money System\n\nBlockchain 5.0 & DLTs: The Ultimate Snake Oil\n\nBack in 2009, a network was launched which produced the digital currency bitcoin and no one moved a muscle. These days cryptocurrencies are a hell of a lot more popular than those days but there are some people who have this idea that the technology \u201cbehind\u201d digital currencies represents the real innovation. You\u2019ve heard it time and time again, that \u2018Blockchain Technology

AndyTheFactory avatar Oct 24 '23 12:10 AndyTheFactory