go-boilerpipe
go-boilerpipe copied to clipboard
JSON and Image Extraction
Hello, wondering if it's on your roadmap to have JSON and Image Extraction features?
{ "response": { "title": null, "content": " \nThe third characteristic of web page writing ....\n ", "source": "https://www.york.ac.uk/teaching/cws/wws/webpage4.html", "images": [ { "src": "https://www.york.ac.uk/teaching/cws/wws/toofar.gif", "width": null, "height": null, "alt": "some alt" } ] }, "status": "success" }
Hi Wayne,
I don't currently have plans to add these features because I don't think they fit well with the original goals of the Java boilerplate library, which is (as far as I can tell) only to extract text from HTML documents, even if that means losing images or potential JSON data.
I think you might be better off using the HTML parsing library directly, along with go-boilerpipe if needed.
That said, I'm not completely opposed to adding these features if they're something you need/want. Maybe making it optional (disabled by default) would be ok, just to preserve some backwards compatibility. Of course if you made any pull-requests I'd be happy to consider them.
Thank you for your question and interest in this library!