go-boilerpipe icon indicating copy to clipboard operation
go-boilerpipe copied to clipboard

JSON and Image Extraction

Open wayneconnolly opened this issue 5 years ago • 1 comments

Hello, wondering if it's on your roadmap to have JSON and Image Extraction features?

{ "response": { "title": null, "content": " \nThe third characteristic of web page writing ....\n ", "source": "https://www.york.ac.uk/teaching/cws/wws/webpage4.html", "images": [ { "src": "https://www.york.ac.uk/teaching/cws/wws/toofar.gif", "width": null, "height": null, "alt": "some alt" } ] }, "status": "success" }

wayneconnolly avatar Aug 18 '19 03:08 wayneconnolly

Hi Wayne,

I don't currently have plans to add these features because I don't think they fit well with the original goals of the Java boilerplate library, which is (as far as I can tell) only to extract text from HTML documents, even if that means losing images or potential JSON data.

I think you might be better off using the HTML parsing library directly, along with go-boilerpipe if needed.

That said, I'm not completely opposed to adding these features if they're something you need/want. Maybe making it optional (disabled by default) would be ok, just to preserve some backwards compatibility. Of course if you made any pull-requests I'd be happy to consider them.

Thank you for your question and interest in this library!

jlubawy avatar Aug 25 '19 07:08 jlubawy