Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

blockScraper implementation

Open lurenss opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe. A scraper pipeline capable of retrieve all the similar blocks in a page, like ecommerce, weather, fly websites

Describe the solution you'd like I have found this paper https://www.researchgate.net/publication/261360247_A_Web_Page_Segmentation_Approach_Using_Visual_Semantics It deals specifically wti this issue

Describe alternatives you've considered nope

Additional context Screenshot 2024-04-27 at 15 04 05

lurenss avatar Apr 27 '24 13:04 lurenss

Neat idea but would it be simpler to just group web elements with the same css tags? A computer vision approach seems a bit over-engineered.

epage480 avatar May 06 '24 12:05 epage480

@epage480 This isn't a CV approach, it's a grouping similar object from HTML, if you want to help to implement this paper with us let me know, here the reference A Web Page Segmentation Approach Using Visual Semantics

lurenss avatar May 07 '24 07:05 lurenss

@lurenss, I would much rather focus on this paper, an empirical comparison of web page segmentation algorithms, 2021.

much more recent and detailed comparison of all major web page segmentation algorithms.

tl;dr - microsoft's VIPS algorithm is still the best one out there; you can find implementations in java, JS or python.

DiTo97 avatar May 13 '24 00:05 DiTo97