Scrapegraph-ai
Scrapegraph-ai copied to clipboard
blockScraper implementation
Is your feature request related to a problem? Please describe. A scraper pipeline capable of retrieve all the similar blocks in a page, like ecommerce, weather, fly websites
Describe the solution you'd like I have found this paper https://www.researchgate.net/publication/261360247_A_Web_Page_Segmentation_Approach_Using_Visual_Semantics It deals specifically wti this issue
Describe alternatives you've considered nope
Additional context
Neat idea but would it be simpler to just group web elements with the same css tags? A computer vision approach seems a bit over-engineered.
@epage480 This isn't a CV approach, it's a grouping similar object from HTML, if you want to help to implement this paper with us let me know, here the reference A Web Page Segmentation Approach Using Visual Semantics
@lurenss, I would much rather focus on this paper, an empirical comparison of web page segmentation algorithms, 2021.
much more recent and detailed comparison of all major web page segmentation algorithms.
tl;dr - microsoft's VIPS algorithm is still the best one out there; you can find implementations in java, JS or python.